Academic Awards 2024 booklet

17 Structured state space models for de novo drug design When looking for candidates for therapeutic drugs, the vast chemical space consisting of 10 60 drug-like molecules needs to be explored. One way to do this is by designing new molecules with desired properties from scratch, a process which is called de novo drug design. Efficient exploration of the enormous chemical space can be accomplished by using deep learning models for de novo drug design. To use these models, molecules are (for example) represented with text representations such as the Simplified Molecular Input Line Entry Systems (SMILES) strings (see Figure 1a). This means that language models designed for natural language processing can also learn this chemical language, and generate new molecules. In my thesis, I adapted the structured state-space sequence (S4) model, originally by Gu et al ., so it could be trained with and generate SMILES strings. This model architecture has a dual structure (see Figure 1b), in the sense that S4 is formulated as a global convolution during training, processing the entire input sequence’s properties, and as a recurrence during generation, generating SMILES strings efficiently element-by- element. I showed that S4 models are capable of generating valid, unique and novel molecules (see Table 1) – showing the potential of S4 for de novo drug design. Figure 1: (a) An example of how a molecule can be represented with text using Simplified Molecular Input Line Entry System (SMILES) strings. (b) The two different representations of the structured state-space sequence model: as a convolution during training and as a recurrence during generation. Table 1: After generating around 10,000 SMILES strings with the 3 types of structured state-space sequence (S4) models, the generations are evaluated based on three metrics: 1) validity, i.e. the frequency of chemically valid molecules represented by the generated SMILES strings, 2) uniqueness, i.e. the frequency of structurally-unique valid molecules unique among the generations, and 3) novelty, i.e. the frequency of valid, unique generations not present in the training data set. SMILES string O OH CC(Cc1ccc(C(C(O)=O)C)cc1)C a Dual formulation Output Input <BEG> C=C(CN)c1ccc(Cl)cc1 C=C(CN)c1ccc(Cl)cc1 <END> * y K= f ( A,B,C ) u k y k-1 B C x k-1 y k u k B C x k A y k+1 u k+1 B C x k+1 A u k-1 S4 training S4 generation b Model Validity Uniqueness Novelty S4 23.1% 99.1% 90.7% S4-mini 22.2% 99.0% 92.8% S4-extramini 2.8% 79.1% 69.1%