MACHINE LEARNING STRATEGIES FOR DRUG DISCOVERY AND DEVELOPMENT
HTML Full TextMACHINE LEARNING STRATEGIES FOR DRUG DISCOVERY AND DEVELOPMENT
Muskan Verma, Shiv Hardenia * and Dinesh Kumar Jain
IPS Academy College of Pharmacy, Knowledge Village, Rajendra Nagar, A.B. Road, Indore, Madhya Pradesh, India
ABSTRACT: This review paper provides a comprehensive overview of the role of machine learning (ML) in drug discovery and development within the pharmaceutical industry. It begins by outlining the foundational concepts of machine learning, highlighting its ability to enhance decision-making and improve accuracy through data analysis. The paper emphasizes the growing adoption of ML techniques across the pharmaceutical sector, showcasing their potential to streamline drug discovery processes, reduce costs, and minimize reliance on animal testing. It categorizes various machine learning methods, such as supervised, unsupervised, semi-supervised, and reinforcement learning, and discusses their applications in drug discovery, including predicting drug efficacy, optimizing lead compounds, and validating safety biomarkers. Furthermore, the paper delves into advanced ML techniques like transfer learning, multitask learning, and active learning, which address challenges related to data scarcity and enhance model performance. The discussion also covers specific algorithms such as Support Vector Machines, Decision Trees, and Artificial Neural Networks, illustrating their utility in predicting biological properties and improving drug design. Ultimately, the paper concludes that the integration of machine learning in drug discovery promises to enhance efficiency and accuracy and heralds a new era of innovation in pharmaceutical research and development.
Keywords: Machine Learning, Drug Discovery and Development, Advanced techniques, Pharmaceutical industry
INTRODUCTION: Machine learning (ML) is a branch of artificial intelligence that allows computers to think and learn independently. It focuses on enabling computers to adjust their behaviours to enhance performance and achieve greater accuracy. As accuracy is being evaluated based on how often the selected actions lead to correct outcomes 1. Machine learning is an multi-faceted tool with numerous research fields supporting its development, as shown in Fig. 1.
Background: The modern origin of machine learning is traced to psychologist Frank Rosenblatt at Cornell University, who inspired by the human nervous system developed the perceptron in the late 1950s. This machine, designed to recognize alphabet letters used both analogue and discrete signals and featured a threshold element to convert analogue data into discrete outputs. It became the prototype for modern artificial neural networks (ANNs).
Rosenblatt also conducted the first mathematical studies of perceptron learning (1959). The Novikoff theorem (1962) later formalized conditions for ensuring the perceptron’s learning algorithm converges in a finite number of steps 2. The evolution of Machine Learning (ML) is characterized by significant achievements that have greatly influenced its development. In 1950, Alan Turing's "Computing Machinery and Intelligence" paper introduced the idea of a "universal machine" that could display intelligent behaviour 3. In 1956, the Dartmouth Conference established artificial intelligence as a distinct field, gathering prominent researchers to explore AI studies and outline its objectives 4. In 1957, Frank Rosenblatt introduced the Perceptron, which was the first single-layer neural network that could learn via a method known as "perceptron learning" 5. In 1957, the General Problem Solver (GPS) program, created by Newell and Simon, showcased a method of problem-solving that utilized symbolic reasoning 6.
In 1976, the MYCIN system created by Edward Shortliffe showcased the use of expert systems for medical diagnosis by employing a rule-based methodology 7. In 1981, Gerald Dejong introduced Explanation-Based Learning (EBL), which allowed the computer to examine training data and formulate rules for eliminating irrelevant information 8. In 1985, Terry Sejnowski created Net Talk, a system that learned to pronounce English words similarly to how children do 9. In 1988, Paul Werbos introduced back propagation, an algorithm that enabled the efficient training of multi-layer neural networks, laying the foundation for contemporary deep learning 10, 11. In 1997, IBM's Deep Blue triumphed over world chess champion Garry Kasparov, signifying a significant achievement for artificial intelligence in strategic games and showcasing its ability to tackle combinatorial challenges 12. In 1998, Yann Le Cun and his team developed LeNet-5, showcasing the effectiveness of convolutional neural networks (CNNs) in image recognition and establishing a basis for contemporary computer vision 13. In 2009, the Image Net initiative, spearheaded by Fei-Fei Li, unveiled a vast labelled dataset and benchmarks for deep learning, significantly propelling advancements in computer vision, especially through deep convolutional networks 14. In 2011, IBM Watson triumphed in the Jeopardy game show, showcasing its abilities in natural language processing and large-scale knowledge retrieval 15.
In 2012, the AlexNet deep learning framework, created by Krizhevsky, Sutskever, and Hinton, transformed the field of image classification by showcasing the capabilities of GPUs for training deep neural networks 16. In 2016, Alpha Go, created by Deep Mind, triumphed over world Go champion Lee Sedol, demonstrating the progress in reinforcement learning and the capability of AI to master intricate strategies 17. In 2017, Alpha Go Zero eclipsed its earlier version by mastering the game of Go entirely through reinforcement learning, relying solely on its own experiences rather than any guidance from human experts 18. In 2017, the Transformer model, presented by Vaswani and colleagues, transformed the landscape of natural language processing, especially in sequence-to-sequence tasks, achieving better results than earlier approaches 19.
FIG. 1: MULTIPLE FACETS OF MACHINE LEARNING
In 2018, Devin and colleagues developed BERT (Bidirectional Encoder Representations from Transformers), which achieved remarkable outcomes in handling various natural language processing tasks, such as question-answering and sentiment analysis 20. In 2019, Open AI's Alpha Star superseded human gamers in Star Craft II, highlighting the capabilities of reinforcement learning in intricate real-time strategy games 21. In 2021, Alpha Fold triumphs in the CASP competition by accurately predicting the three-dimensional structure of proteins 22. In 2023, the FDA published two documents outlining its efforts to promote the use of AI and machine learning to enhance drug discovery, noting that over 100 regulatory submissions have utilized AI/ML methods so far. Additionally, the FDA Guidance was made available 23. The utilization of machine learning is increasingly prevalent across various sectors of the pharmaceutical industry, leading to overall improvements in the field. The effectiveness of machine learning is reflected in the rising number of companies that have adopted it within their business strategies. It has been observed that large pharmaceutical organizations are also investigating machine-learning techniques for purposes related to drug research and development. The possibilities offered by machine learning and its importance in drug discovery are substantial; thus, incorporating it into future advancements in this domain is crucial. The aim is to utilize high-throughput screening methods to reduce both the costs and labor associated with drug discovery. In the long run, machine learning could significantly cut down, if not eliminate, the necessity for testing on live animals. These investigations suggest that machine learning is an incredibly valuable tool in the area of drug discovery. To enhance and improve machine learning technologies for drug discovery and development, it is critical to consider various factors related to chemical and biological data. This information would aid in developing more advanced and accurate systems by utilizing insights derived from the data. To gather this information, medicinal parameters such as cellular toxicity, variability in cellular structure, the efficacy of animal models, on-target activity, pharmacokinetic indicators, microsomal stability, and cytochrome P450 (CYP) inhibition values must be evaluated through assays 24.
FIG. 2: DIFFERENT AREAS IN DRUG DISCOVERY UTILIZING MACHINE LEARNING
In this review, we are mainly concentrating on the characteristics of ML methods that are suitable for drug development and discovery. Recently, a variety of factors have emerged from the increasing excitement about applying machine learning techniques within the pharmaceutical sector. Fig. 2 shows that the different areas of Drug Discovery and the advancements leveraged by machine learning are explored. Each stage is executed like a pipeline to illustrate therapeutic ideas. The individual phases denote distinct iterations regarding time and financial investment. Every phase is conducted to demonstrate the efficacy of the therapeutic intervention. Medical data was mined and assessed precisely with the help of various ‘omics’ and ‘smart automation tools’. The effectiveness of machine learning techniques has been evaluated to establish any drug within the pharmaceutical sector. When paired with unlimited storage capacity, enhancements in dataset attributes such as size and variety can offer a foundation for machine learning. This approach enables access to vast amounts of data from pharmaceutical companies. The data types can vary in configuration, including text, images, assay results, biometrics, and high-dimensional omics data 25.
Machine Learning and Types: In the near future, humans might find it difficult to think independently of machine learning. Machine learning (ML) is becoming increasingly significant every day due to its efficiency. For instance, those who have watched videos on YouTube may have noticed that more relevant videos are recommended to them. This phenomenon is a result of machine learning, as backend algorithms examine user browsing data to determine preferences and then display videos that align with users' interests. Machine learning understands how to handle data effectively when it receives proper training 26.
Machine learning techniques are a subset of artificial intelligence that employs specific algorithms to enable computers to "learn" and conduct important decision-making processes based on representative data samples. They are particularly beneficial when the necessary software to accomplish a specific task or achieve a desired objective is lacking or impractical 27. Machine learning provides several significant advantages, such as the capability to identify trends and patterns through the analysis of extensive datasets (for instance, Amazon's product recommendations). It facilitates automation without the need for human supervision, allowing it to make predictions and enhance its performance independently (like antivirus programs or spam filters). As it analyzes more data, machine learning consistently improves its precision and effectiveness, similar to weather forecasting. It also handles complex and varied data, adjusting to dynamic and unpredictable circumstances. Ultimately, machine learning has widespread applications across different sectors, including e-commerce and healthcare, where it customizes user experiences, optimizes processes, and forecasts outcomes 28.
Machine learning has various disadvantages, including the requirement for large amounts of high-quality data during the training phase, which can be a slow and laborious process to obtain. Training and enhancing ML models is a resource-demanding endeavour that necessitates considerable computing power and can take a long time, often resulting in high costs. Moreover, interpreting the outcomes of ML models can be difficult, as it typically demands specialized knowledge, and choosing the appropriate algorithm is essential for achieving accuracy. ML models are also susceptible to mistakes, particularly when trained on biased or incomplete data, and these errors can remain undetected for extended periods. The performance of a model is directly influenced by both the quality and quantity of data, and the presence of inaccurate or biased information can lead to erroneous conclusions. Finally, ongoing supervision is required, as models need continual monitoring and retraining when new data comes in, which can be both labor-intensive and costly 28.
Types of Machine Learning: ML algorithms can generally be divided into four main categories: supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning, as illustrated in Fig. 3.
Supervised Learning: Supervised learning consists of training a machine learning model to understand a function that relates inputs to outputs using a collection of labeled input-output pairs. It employs a collection of annotated samples to create a mapping function.
This method is usually employed when particular goals must be accomplished from a defined array of inputs, rendering it a task-oriented technique. Typical supervised tasks involve "classification," which sorts data into categories, and "regression," which aligns data with a continuous scale. An instance of supervised learning is forecasting the category or sentiment of a text, like a tweet or a product review 29.
FIG. 3: VARIOUS TYPES OF MACHINE LEARNING
Unsupervised Learning: Unsupervised learning focuses on training algorithms to identify patterns in unlabelled data through methods such as clustering and association. It categorizes data according to common characteristics, frequently used in fields such as social network analysis, market segmentation, and DNA classification. Although it can reveal hidden insights and identify fraud, it necessitates human analysis and might yield unclear or less precise outcomes. In digital marketing, it is utilized for analyzing customers and personalizing services in platforms such as Sales force 30, 31.
Semi- Supervised Learning: Semi-supervised learning is a combination of supervised and unsupervised methods, utilizing both labeled and unlabeled data. The primary aim of a semi-supervised learning model is to achieve more accurate predictions than those made using only the labeled data. This approach is commonly applied in areas such as machine translation, fraud detection, data labeling, and text classification 32.
Reinforcement Learning: Reinforcement learning (RL) integrates concepts from cybernetics, psychology, and computer science, allowing agents to acquire knowledge via trial-and-error experiences in various environments. It canters on two primary methods: exploring behavior spaces (such as genetic algorithms) and applying statistical techniques for estimating action utility. Important subjects consist of exploration versus exploitation, postponed rewards, model-free methods such as TD (λ) and Q-learning, generalization, and partial observability. The document wraps up with conversations about the applications of RL, associated challenges, and unresolved issues for upcoming research 33.
Drug Discovery and Development: The process of drug discovery is intricate and varied, focused on finding novel chemical compounds that can successfully treat or control diseases. Scientists usually discover new medications by obtaining a new understanding of disease processes, enabling them to create treatments that can combat or stop the advancement of the disease 34. The discovery process includes finding possible drug candidates, creating them, analyzing their characteristics, and performing tests for therapeutic effectiveness. If a compound shows positive outcomes in these phases, it advances to the drug development stage, encompassing clinical trials. The whole procedure is both lengthy and expensive, typically lasting 12-15 years and needing an expenditure of $900 million to $2 billion for each successful medication. This elevated expense signifies the numerous unsuccessful efforts, since generally, only one medication from 5,000 to 10,000 compounds tends to receive approval 35. The hurdles of drug discovery require significant resources, elite scientific expertise, cutting-edge technology, and efficient project management, with achievement also relying on perseverance and a touch of fortune. In the end, effective drug discovery provides hope and comfort to countless patients around the globe 36.
The life cycle of drug development is shown in Fig. 4.
Stages Involved in Drug Discovery and Development:
- Target Discovery
- Target Validation
- Hit Discovery
- Hit to Lead
- Lead Optimization
- Pre-Clinical studies
- Clinical Trials
- Post Development Monitoring & Pharma-covigilence
Drug Discovery: The method through which a new medication reaches the market is known by several terms, the most frequent being the development chain or "pipeline." Numerous estimates exist regarding the expenses associated with each phase of the pipeline. Most indicate that clinical trials represent the costliest phase, comprising at least 40% of total expenses 37. The drug discovery process typically spans approximately 15 years, with the notable aspect being that it is finalized for only a limited number of drug candidates, as numerous parameters must be satisfied to progress through every stage.
FIG. 4: LIFE CYCLE OF DRUG DISCOVERY AND DEVELOPMENT
Target Discovery: Target discovery is a crucial first phase in drug development, emphasizing the discovery and validation of proteins that can be targeted for treating diseases, especially chronic conditions. Two primary strategies are employed: the molecular method, which utilizes genomics and proteomics to reveal genes and proteins linked to diseases, and the systems method, which analyzes diseases at the organismal level, incorporating clinical and in-vivo information. Both approaches are improved through technologies such as siRNA, knockout models, and genetic research to confirm targets. Although the molecular approach is increasingly important, the systems approach is still essential for comprehending complex diseases and enhancing drug development effectiveness 38.
Target Validation: Target validation consists of verifying that a prospective drug target has biological significance, is associated with the disease, and can be influenced by drugs to generate therapeutic outcomes. Validating a target without conducting clinical trials is difficult, but different methods, from in-vitro tests to whole animal research, are utilized based on the organization’s strategy and resources. Effective validation frequently necessitates employing a mix of techniques, including antisense agents, ribozymes, RNA interference, gene knockouts, and small molecules. Every method comes with distinct benefits and limitations, emphasizing the necessity for a varied range of instruments to ensure dependable target validation 39.
Hit to Lead Selection: Hit-to-lead selection in drug discovery starts with identifying small molecules that can modulate a target effectively. This can be done using various methods, depending on available target information.
Techniques like mutagenesis, NMR, and X-ray crystallography require detailed knowledge, while random screening methods (e.g., biophysical testing) do not. More targeted approaches, such as chemogenomics and HTS combined with computational chemistry, help refine searches for potential leads. As the number of drug targets increases, strategies to improve efficiency and reduce costs in lead generation will be crucial, with access to high-quality screening libraries being key to success 40.
Lead Optimization: Lead optimization is a crucial stage in drug discovery, where initial lead compounds found through high-throughput screening are improved to boost their drug-like properties. This procedure includes repeated adjustments of chemical structures to enhance target binding, along with their pharmacokinetic, pharmacodynamics, and toxicity characteristics. The primary objectives are to enhance specificity, selectivity, bioavailability, and stability, all while lowering toxicity. High-throughput screens for drug metabolism and pharmacokinetics (DMPK) are crucial for forecasting in-vivo performance. Methods like mass spectrometry, MALDI imaging, and NMR Fragment-based Screening (FBS) are essential in assessing metabolites and refining lead compound for additional advancement 41.
Drug Development: After the discovery of a new chemical entity, it must undergo the development procedure. The capability for chemical synthesis is primarily conducted in the research and development departments of pharmaceutical companies. The relationships between structure and activity (SAR) are established. The examination of the potential compound can be separated into two phases: preclinical pharmacology (animal experimentation) and clinical pharmacology (human experimentation) 42.
Preclinical Studies: Preclinical studies are designed to offer insights into the safety and effectiveness of a drug candidate prior to human testing. Additionally, they can demonstrate proof of the compound's biological impact and typically feature both in-vitro and in-vivo research. Preclinical research must adhere to the regulations outlined by Good Laboratory Practice to provide dependable outcomes and is mandated by organizations like the FDA prior to submitting for IND approval. Understanding the compound's dosage and toxicity is crucial to assess if it is appropriate and sufficiently safe to advance to clinical trials, which is informed by research on pharmacokinetics, pharmacodynamics, and toxicology 43.
Clinical Studies: A clinical trial is a prospective investigation that evaluates the effects of interventions by contrasting them with a control group among human subjects. Participants are monitored from a defined starting point, unlike retrospective studies. These trials include proactive interventions, such as diagnostic, preventive, or therapeutic treatments, to assess their impacts. A control group is used to attribute outcome differences to the intervention. Research often assesses novel treatments in comparison to existing standards or a placebo when no standard treatment is available. Safety, obtaining informed consent, and adhering to protocols are vital. While randomized, double-blind studies are ideal, compromises may occur, highlighting the necessity of a robust study design to tackle clinical questions 44.
Phasesof Clinical Trials: Clinical trials are generally categorized into four phases (I-IV), with each phase concentrating on various elements of drug development and evaluation. Phase I trials evaluate a drug's safety, tolerability, pharmacokinetics, and pharmacodynamics, typically involving a limited number of healthy participants. Phase II studies assess the drug's efficacy, ideal dosage, and adverse effects in a broader patient population, frequently utilizing biomarkers to evaluate results. Phase III trials are extensive, randomized studies that validate the drug's clinical effectiveness and safety, positioning it as a treatment choice for the wider population. Phase IV trials take place after approval to track long-term effects, uncover rare side effects, and evaluate the drug’s efficacy in wider or specific populations. Although these phases are separate, certain studies might intersect, merging features from different phases to tackle a range of research inquiries 44.
Application of Machine Learning Methods in Drug Discovery and Development: Machine Learning (ML) methods have greatly helped with progress in drug development, resulting in notable enhancements in the pharmaceutical sector. Using different ML techniques in drug discovery has shown great advantages for pharmaceutical companies. ML algorithms are used to create models that forecast the chemical, biological, and physical characteristics of compounds used in drug development. As drug development advances, the importance of these trained models will continue to grow. The pharmaceutical industry is utilizing machine learning techniques in various aspects such as determining drug effectiveness, forecasting drug-protein interactions, validating safety biomarkers, and improving the bioactivity of molecules. In the industry, ML methods like Support Vector Machines (SVM), Random Forest (RF), and Naïve Bayes (NB) have become increasingly popular 45.
FIG. 5: MACHINE LEARNING METHODS USED IN STAGES OF DRUG DISCOVERY AND DEVELOPMENT
There are several machine learning algorithms, and among them, some of the most commonly used approaches in drug discovery and development include:
Support Vector Machine: Support Vector Machines (SVMs) are powerful machine learning algorithms used in drug discovery and development to classify and predict various compound characteristics, such as biological activity, toxicity, and pharmacological properties. SVMs work by transforming molecular data into a high-dimensional space and finding an optimal hyperplane that separates different classes, often using kernel functions like linear, polynomial, or radial basis functions.
In drug discovery, SVMs are applied to prioritize compounds in virtual screening, predict properties like solubility and bioavailability, and identify structural features that impact activity. Combining SVMs with other methods, such as molecular docking and QSAR modeling, enhances predictive accuracy and aids in identifying promising drug candidates. Recent advancements in SVMs, including the development of new kernel functions (e.g., graph and pharmacophore kernels), allow for better handling of complex molecular structures and more precise protein-ligand interaction predictions 46.
Decision Trees: Decision Trees (DTs) are machine learning algorithms that classify data based on rules derived from molecular features. They create a hierarchical structure where data is split into subsets through test conditions, ultimately classifying compounds by reaching leaf nodes. In drug discovery, DTs are used to classify compounds based on biological activity, drug-likeness, and toxicity, and to predict important pharmacokinetic properties like absorption, distribution, metabolism, excretion (ADME), and toxicity (Tox). DTs are also useful for designing combinatorial libraries and handling large datasets from high-throughput or virtual screening. While DTs are interpretable and aid in understanding molecular influences on drug properties, they can be prone to over fitting and instability with small datasets. Combining DTs with other models, like Random Forests, can improve their robustness and accuracy in drug discovery 47.
Naïve Bayesian Classifier: Naïve Bayesian Classifiers are machine learning models that classify data based on Bayes' theorem, assuming feature independence given the class label. Despite this simplification, they are highly effective in drug discovery for predicting biological properties, such as toxicity, protein target binding, and bioactivity. They are widely used in virtual screening (VS) to classify compounds as active or inactive, aiding in compound prioritization.
These classifiers can also predict toxicity mechanisms and support the design of safer drug candidates. By modeling the probability of activity from molecular descriptors, they are useful for ranking large compound databases. Advanced methods like Bayesian networks and model averaging further enhance prediction accuracy, making Naïve Bayesian classifiers a valuable tool in drug discovery for tasks such as activity prediction and multitarget classification 48.
Ensemble Methods: Ensemble methods are machine learning techniques that combine multiple models to improve prediction accuracy and robustness. By aggregating the outputs of various models, such as decision trees in Random Forest (RF), ensemble methods reduce overfitting and enhance generalization.
In drug discovery, these methods improve virtual screening, QSAR modeling, and protein-ligand interaction predictions. They are particularly valuable for handling noisy or imbalanced data, performing feature selection, and identifying key molecular descriptors that influence biological activity. Ensemble methods also facilitate multitarget drug discovery by predicting compound interactions with multiple targets, making them essential tools for optimizing lead compounds and improving drug discovery efficiency 49.
k-Nearestneighbors: The k-Nearest Neighbors (k-NN) algorithm is a simple, instance-based machine learning method used for classification, regression, and ranking tasks. It predicts the class or property of a molecule by comparing it to its k nearest neighbors in a feature space, using distance metrics like Euclidean or Manhattan. In drug discovery, k-NN is widely used for predicting biological activity, screening compounds, and profiling pharmacological properties by classifying molecules based on their similarity to known compounds. It can also predict drug-drug interactions and is effective in QSAR modeling and protein function prediction. However, k-NN is sensitive to noisy data and irrelevant features, and its performance depends on choosing the right distance metric and the value of k. Despite these limitations, it is a valuable tool for early-stage drug discovery, especially in compound screening and activity prediction 48.
Artificial Neural Networks: Artificial Neural Networks (ANNs) are machine learning models inspired by the human brain, consisting of interconnected neurons organized into layers.
These networks can model complex, nonlinear relationships in data and are trained using supervised or unsupervised learning. ANNs have a broad range of applications in drug discovery, including predicting compound activity, performing QSAR modeling, identifying drug targets, discovering biomarkers, repurposing drugs, and optimizing drug design. They are particularly valuable for analyzing large, complex datasets, but their "black box" nature can make their decision-making process difficult to interpret. Despite this, their ability to capture intricate patterns makes them a powerful tool in advancing drug development 50.
Advanced Machine Learning Techniquesin Drug Discovery: Traditional Multi-Task Learning (MTL) methods face challenges like the need for large datasets and significant human intervention. To overcome these limitations, advanced techniques such as Reinforcement Learning (RL), Transfer Learning, and enhanced MTL have been developed. RL fosters more autonomous learning, Transfer Learning enables knowledge transfer between tasks with limited data, and MTL allows for predictive modeling even with smaller datasets. These techniques are particularly valuable in fields like drug discovery, where data scarcity is a common issue 51.
- Reinforcement Methods
- Transfer Learning
- Multitask Learning
- Active Learning
- Generative Models
- Bayesian neural network
Reinforcement Methods: Reinforcement Learning (RL) is a form of machine learning in which an agent learns to make decisions through interactions with its environment and receiving feedback in the form of rewards or penalties. In contrast to supervised learning, which relies on labeled data for training, RL involves continual learning and improving actions based on total rewards. RL comprises the agent, environment, states, actions, policy, and reward function as its essential elements 52. RL has been utilized in multiple fields like gaming, robotics, and finance, and is now receiving interest in drug discovery. In drug discovery, reinforcement learning (RL) is applied to tasks like de novo drug design to produce new molecules with desired properties through collaboration with generative models. For instance, RL has been used to enhance molecular structures, accelerate drug development, and enhance drug specificity. RL models have successfully produced biologically feasible compounds by generating molecules with specific objectives, like improving biological activity or chemical complexity. RL has also found use in personalized healthcare, specifically in real-time treatment decisions for conditions such as sepsis or anaesthesia, as well as in medical imaging and disease identification. Nevertheless, more improvement is necessary to guarantee that RL produces a variety of compounds unique to drugs and prevents the formation of molecules resembling current medications 53.
Transfer Learning: Transfer Learning involves applying knowledge from one task to enhance learning in a related task, especially when data for the new task is scarce. Initially, a model is trained on a substantial dataset for a certain task, and then the acquired knowledge is shifted to a different task with lesser data. This method is especially handy when the original task has a lot of data while the new task has minimal data, resulting in quicker development and improved performance compared to starting from scratch with small datasets 54. Transfer learning in drug discovery helps to address challenges caused by limited data availability and speeds up different phases of development. Models trained on gene expression data from one cancer type are utilized to predict drug responses in different types of cancer in drug sensitivity prediction 55. It is additionally utilized in predicting adverse drug reactions (ADRs) by transferring information from extensive drug-related text datasets to recognize safety issues in novel drugs 56. Moreover, transfer learning supports the creation of new drug-like molecules for de novo drug design through using existing chemical databases, ultimately boosting the chances of success in clinical trials 57. Also, transfer learning is used for predicting biological processes and material science, allowing for more precise predictions with limited data. In general, transfer learning is a strong asset in the field of drug discovery, enhancing the speed of model creation and enhancing the precision and effectiveness of drug design, safety testing, and efficacy assessment.
Multitask Learning: Multitask Learning (MTL) is a machine learning technique in which one model is trained to handle multiple tasks at once, enabling it to exchange information among related tasks. Instead of creating individual models for each task, MTL enhances learning by utilizing common internal representations, with multiple layers dedicated to each task. This technique works well with data that is sparse or noisy by enabling knowledge transfer, data amplification, and over fitting reduction. MTL can be used in supervised and unsupervised learning, utilizing algorithms like neural networks, k-nearest neighbors (kNN), support vector machines (SVM), and Bayesian regression 58.
MTL is especially beneficial in drug discovery for creating multitarget drugs that target multiple sites at once. Even though these medications may lead to negative consequences because of their wide range of effects, they have also shown to be more successful in treating intricate conditions such as cancer and metabolic disorders. MTL has been used to discover numerous drug targets affected by one compound, improving the comprehension of drug interactions with biological pathways. Researchers such as Li et al. applied MTL to discover significant multitarget interactions by utilizing unsupervised machine learning, gene expression data, and compound structure data 59. MTL is beneficial for drug screening as well, with platforms like Macau allowing for the analysis of drug properties and their impact on cell lines simultaneously, resulting in more precise forecasts of drug behavior and mechanisms of action 60. Moreover, the integration of MTL with gradient-boosting decision trees has demonstrated enhanced effectiveness on limited datasets, specifically in the field of drug-target interaction analysis 61. MTL has also been utilized in sentiment analysis of drug reviews, aiding in the understanding of public perceptions regarding drug effectiveness and potential side effects 62. MTL improves model generalization and prediction by learning multiple tasks simultaneously, offering a more efficient approach to drug development.
Active Learning: Active Learning is a method of machine learning that deals with the issue of minimal labeled data, which may be expensive or time-consuming to acquire. In contrast to passive learning, in active learning, a model selects and requests the most informative data points from a vast collection of unlabeled data instead of being trained on a fixed set of labeled data. The model asks for labels only for samples it is unsure about, enabling it to learn more effectively with less labeled data. This method is especially beneficial in cases where labeling is costly, as it optimizes the utility of the existing labeled data. Active learning speeds up the drug discovery process and cuts down on expenses. For example, researchers can predict how drugs cross the blood-brain barrier by labeling a small percentage (e.g., 10%) of molecules and using active learning to predict the other 90% 63.
The model highlights the samples with the highest level of uncertainty, guiding researchers to concentrate their experimental work on them, ultimately reducing the required number of experiments. This specific strategy accelerates the process of constructing models and minimizes the resources needed. Active learning has effectively been utilized in drug discovery to predict small-molecule bioactivity, ligand-target interactions, and toxicity. By integrating active learning with machine learning methods such as Support Vector Machines (SVM) or deep learning, scientists can enhance prediction precision using less experimental data, therefore simplifying and reducing the costs of drug development 64.
Generative Models: Generative models are a machine learning method that aims to produce new data examples by understanding the underlying distribution of the input data. Generative models, in contrast to discriminative models, model the data generation process probabilistically, enabling them to generate new instances from the same distribution as the training data rather than just predicting labels based on the data provided. These models can generate data samples that match the learned distribution without needing explicit rules or labels, unlike discriminative models which predict specific labels for given instances 65. In the field of drug discovery, generative models are especially handy for tasks such as de novo drug design, as they have the ability to create new molecular structures with specific characteristics. This ability allows for the discovery of novel chemical realms for possible drug prospects, particularly in situations with sparse data. Furthermore, generative models can also be used for data augmentation and reducing dimensionality, aiding in simplifying intricate datasets and enhancing analysis efficiency. Nevertheless, it is essential to thoroughly verify the uniqueness of any newly created compounds to prevent duplication or repetition of existing drugs or input compounds in order to maintain originality in drug development 66.
Bayesian Neural Network: Bayesian Neural Networks (BNNs) are machine learning models that harness Bayesian inference to merge numerous neural networks, proving especially beneficial for limited datasets. In contrast to conventional neural networks that are prone to over fitting, BNNs address this issue by incorporating prior probability distributions, leading to more reliable predictions 67. BNNs are especially valuable in the field of drug discovery, where data is frequently limited. Recent research has demonstrated their ability to forecast molecular behavior, detect genes associated with drug responses, and assess drug similarity. Although they have benefits, BNNs need a more intricate design and skill to determine and revise prior distributions, yet their capability to deal with uncertainty and small data sets makes them a valuable tool in areas with scarce labeled data 68.
Open Problems: The main obstacles and unresolved issues when using machine learning (ML) for drug discovery, especially as the industry progresses from preclinical to clinical phases 69. These obstacles underscore the requirement for innovative ML solutions for enhancing predictability, interpretability, and effectiveness. Key challenges involve the difficulty of forecasting drug mixtures, especially in illnesses such as cancer, where multiple medications may be necessary; weighing toxicity against effectiveness in treatment forecasts; and confirming accuracy with real-world medical information 70.
Furthermore, the lack of transparency and trustworthiness in many ML models is due to their "black box" nature 71. Challenges also involve making precise predictions for drug-target interactions (DTIs) and drug response in various patient groups, as well as understanding the intricate nature of molecular reactions in drug-drug interactions (DDIs). In order to tackle these issues, techniques like explainable AI (XAI) and federated learning are suggested to enhance model interpretability and allow for the utilization of patient-specific data while still preserving privacy. Another area requiring focus is the absence of uniform databases and accessible platforms for enhancing XAI models. In the end, the aim is to create ML models that are not just precise but also clear and understandable, guaranteeing their effective application in drug development in clinics and enhancing success rates 72.
Future Challenges: The future challenges in machine learning for AI-driven digital twinning in pharmaceutical drug discovery encompass a range of critical areas that must be addressed to unlock the full potential of these technologies. One of the primary challenges is ensuring high-quality, diverse, and comprehensive datasets, as the accuracy of AI models heavily depends on the data used for training 73. Incomplete, noisy, or biased data, especially when integrating clinical, genomic, and environmental factors, can undermine the effectiveness of digital twins. Overcoming issues like missing data and noise reduction, as well as integrating multi-source datasets, will be crucial for building robust models that accurately reflect real-world clinical variability 74. Additionally, preventing over fitting and ensuring that models generalize well to unseen data remains a significant hurdle. This requires the implementation of rigorous training methods such as cross-validation, regularization, and transfer learning to ensure digital twins can adapt to diverse patient populations and clinical scenarios. Furthermore, as deep learning models often operate as "black boxes," enhancing model transparency and interpretability through techniques like explainable AI (XAI) is essential for gaining clinician trust and ensuring regulatory acceptance. Personalization of digital twins also poses a challenge, as drug responses vary across individuals. Machine learning techniques such as reinforcement learning and transfer learning will be necessary to create individualized simulations that reflect the genetic, lifestyle, and environmental differences among patients. To stay relevant, digital twins must be continuously updated with real-time patient data, which requires dynamic models capable of incremental learning.
Handling multi-modal and multi-scale data from diverse sources, such as genomics, proteomics, and clinical imaging, will require sophisticated machine learning algorithms that can process and integrate these complex datasets into a unified model. Ethical and privacy concerns, particularly regarding the use of sensitive patient data, must be addressed through privacy-preserving methods like federated learning and differential privacy to ensure both security and regulatory compliance. Moreover, the scalability of these models to simulate large patient populations requires high-performance computing resources and efficient algorithms. Finally, cost and resource constraints associated with data collection, model development, and computational infrastructure will need to be mitigated through strategies like transfer learning and the use of synthetic data to reduce expenses and increase accessibility. Addressing these challenges will be pivotal in ensuring that AI-driven digital twins can transform pharmaceutical drug discovery by providing more accurate, personalized, and efficient models for drug development and patient care 75.
CONCLUSION: In conclusion, while machine learning technologies have already made significant contributions to the pharmaceutical industry and healthcare, overcoming these challenges will be crucial to unlocking their full potential. Addressing issues related to data quality, model interpretability, transfer learning, and optimization will be key to advancing ML role in drug development and improving patient care outcomes. Making it more affordable to developing countries is challenge to overcome.
ACKNOWLEDGEMENT: I would like to show my sincere gratitude and respect towards IPS Academy College of Pharmacy, Indore for providing me with the platform to showcase my knowledge and helping me throughout my work. Special thanks to the Centre for Drug Evaluation and Research (CDER) for providing the necessary resources, data and support that were essential to this work.
CONFLICT OF INTEREST: The authors have no conflict of interest.
REFERENCES:
- Alzubi J, Nayyar A and Kumar A: Machine learning from theory to algorithms: an overview. In Journal of Physics: Conference Series 2018; 1142: 012012. IOP Publishing.
- Fradkov AL: Early history of machine learning. IFAC-PapersOnLine 2020; 53(2): 1385-1390.
- Artificial Intelligence Medical Device Working Group. Machine Learning-enabled Medical Devices: Key Terms and Definitions. In International Medical Device Regulators Forum (IMDRF), IMDRF/AIMD WGN 2022; 67: 2022-05.
- Fleck J: Development and establishment in artificial intelligence. In The Question of Artificial Intelligence 2018; 106-164.
- Joshi AV: Perceptron and neural networks. In Machine learning and artificial intelligence Cham: Springer International Publishing 2022; 57-72.
- Garvey SC: The “general problem solver” does not exist: Mortimer Taube and the art of AI criticism. IEEE Annals of the History of Computing 2021; 43(1): 60-73.
- Qureshi FK: The Evolution of AI Algorithms: From Rule-Based Systems to Deep Learning. Frontiers in Artificial Intelligence Research 2024; 1(02): 250-88.
- Pranav KV and Sarma KJ: Origin, Development and Uses of Machine Learning. International Journal for Multidisciplinary Research 2023; 5(1): 1-8.
- Sejnowski TJ: Large language models and the reverse turing test. Neural Computation 2023; 35(3): 309-42.
- Aggarwal CC: Neural networks and deep learning. Cham Springer 2018; 1-6.
- Kamath U, Liu J and Whitaker J: Deep learning for NLP and speech recognition. Cham, Switzerland: Springer; 2019; 39-80.
- Whittemore T: Beyond the black box: the case for a prize system to encourage artificial intelligence innovation. Michigan State Law Review 2024; 2024(1).
- Bhatt D, Patel C, Talsania H, Patel J, Vaghela R, Pandya S, Modi K and Ghayvat H: CNN variants for computer vision: History, architecture, application, challenges and future scope. Electronics 2021; 10(20): 2470.
- Shafiq M and Gu Z: Deep residual learning for image recognition: A survey. Applied Sciences 2022; 12(18): 8972.
- Baker JJ: A legal research odyssey: artificial intelligence as disruptor. Law Libr J 2018; 110: 5.
- Krizhevsky A, Sutskever I and Hinton GE: ImageNet classification with deep convolutional neural networks. Communications of the ACM 2017; 60(6): 84-90.
- Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M and Dieleman S: Mastering the game of Go with deep neural networks and tree search. nature 2016; 529(7587): 484-489.
- Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A and Chen Y: Mastering the game of go without human knowledge. Nature 2017; 550(7676): 354-359.
- Von Luxburg U, Guyon I, Bengio S, Wallach H, Fergus R, Vishwanathan SV and Garnett R: Advances in neural information processing systems 30. In31st annual conference on neural information processing systems (NIPS 2017), Long Beach, California, USA 2017; 4-9.
- Devlin J: Bert: Pre-training of deep bidirectional transformers for language understanding. arXivpreprint arXiv:1810.04805. 2018; 1-16.
- Vinyals O, Babuschkin I, Czarnecki WM, Mathieu M, Dudzik A, Chung J, Choi DH, Powell R, Ewalds T, Georgiev P and Oh J: Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019; 575(7782): 350-354.
- Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A and Bridgland A: Applying and improving AlphaFold at CASP14. Proteins: Structure, Function, and Bioinformatics 2021; 89(12): 1711-21.
- Available from: Https://www.fda.gov/science-research/science-and-research-special-topics/artificial-intelligence-and-machine-learning-aiml?drug-development.Accessed, 2023.
- Patel V and Shah M: Artificial intelligence and machine learning in drug discovery and development. Intelligent Medicine 2022; 2(3): 134-40.
- Dara S, Dhamercherla S, Jadav SS, Babu CM and Ahsan MJ: Machine learning in drug discovery: a review. Artificial intelligence review 2022; 55(3): 1947-99.
- Rahaman MJ: A Comprehensive Review to Understand the Definitions, Advantages, Disadvantages and Applications of Machine Learning Algorithms. International J of Computer Applicati 2024; 0975: 8887.
- Ren J, Shen W, Man Y and Dong L: editors. Applications of artificial intelligence in process systems engineering. Elsevier 2021; 5.
- Khanzode KC and Sarode RD: Advantages and disadvantages of artificial intelligence and machine learning: A literature review. International Journal of Library & Information Science (IJLIS) 2020; (1): 3.
- Sarker IH: Machine learning: Algorithms, real-world applications and research directions. SN Computer Science 2021; 2(3): 160.
- Pugliese R, Regondi S and Marini R: Machine learning-based approach: Global trends, research directions, and regulatory standpoints. Data Sci and Mana 2021; 4: 19-29.
- Taye MM: Understanding of machine learning with deep learning: architectures, workflow, applications and future directions. Computers 2023; 12(5): 91.
- Reddy YC, Viswanath P and Reddy BE: Semi-supervised learning: A brief review. Int J Eng Technol 2018; 7(1-8): 81.
- Fayaz SA, JahangeerSidiq S, Zaman M and Butt MA: Machine learning: An introduction to reinforcement learning. Machine Learning and Data Science: Fundamentals and Applications 2022; 1-22.
- Haider R: Drug Discovery in the 21st Century. Progress Annals: Journal of Progressive Research 2023; 1(1): 33-50.
- Moffat JG, Vincent F, Lee JA, Eder J and Prunotto M: Opportunities and challenges in phenotypic drug discovery: an industry perspective. Nature reviews Drug Discovery 2017; 16(8): 531-43.
- Schlander M, Hernandez-Villafuerte K, Cheng CY, Mestre-Ferrandiz J and Baumann M: How much does it cost to research and develop a new drug? A systematic review and assessment. Pharmacoeconomics 2021; 1243-69.
- Carracedo-Reboredo P, Liñares-Blanco J, Rodríguez-Fernández N, Cedrón F, Novoa FJ, Carballal A, Maojo V, Pazos A and Fernandez-Lozano C: A review on machine learning approaches and trends in drug discovery. Computational and Structural Biotechnology Journal 2021; 19: 4538-58.
- Tabana Y, Babu D, Fahlman R, Siraki AG and Barakat K: Target identification of small molecules: an overview of the current applications in drug discovery. BMC Biotechnology 2023; 23(1): 44.
- Aryaa H and Coumarb MS: Target identification and validation. The Design and Development of Novel Drugs and Vaccines: Principles and Protocols 2021; 11.
- Pillai N, Dasgupta A, Sudsakorn S, Fretland J and Mavroudis PD: Machine learning guided early drug discovery of small molecules. Drug Discovery Today 2022; 27(8): 2209-15.
- Deore AB, Dhumane JR, Wagh R and Sonawane R: The stages of drug discovery and development process. Asian Journal of Pharmaceutical Research and Development 2019; 7(6): 62-7.
- Murahari M, Nalluri BN and Chakravarthi G: Current Trends in Drug Discovery, Development and Delivery (CTD4-2022). Royal Society of Chemistry Editors 2023.
- Honek J: Preclinical research in drug development. Medical Writing 2017; 26: 5-8.
- Kandi V and Vadakedath S: Clinical trials and clinical research: a comprehensive review. Cureus 2023; 15(2).
- Lavecchia A: Machine-learning approaches in drug discovery: methods and applications. Drug Discovery Today 2015; 20(3): 318-31.
- Pisner DA and Schnyer DM: Support vector machine. In Machine Learning 2020; 101-121.
- Dehghani AA, Movahedi N, Ghorbani K and Eslamian S: Decision tree algorithms. In Handbook of Hydroinformatics 2023; 171-187.
- Itoo F, Meenakshi and Singh S: Comparison and analysis of logistic regression, Naïve Bayes and KNN machine learning algorithms for credit card fraud detection. International Journal of Information Technology 2021; 13(4): 1503-11.
- Zhang Y, Liu J and Shen W: A review of ensemble learning algorithms used in remote sensing applications. Applied Sciences 2022; 12(17): 8654.
- Goel A, Goel AK and Kumar A: The role of artificial neural network and machine learning in utilizing spatial information. Spatial Information Research 2023; 31(3): 275-85.
- Elbadawi M, Gaisford S and Basit AW: Advanced machine-learning techniques in drug discovery. Drug Discovery Today 2021; 26(3): 769-77.
- Polydoros AS and Nalpantidis L: Survey of model-based reinforcement learning: Applications on robotics. Journal of Intelligent & Robotic Systems 2017; 86(2): 153-73.
- Ståhl N, Falkman G, Karlsson A, Mathiason G and Bostrom J: Deep reinforcement learning for multiparameter optimization in de novo drug design. Journal of Chemical Information and Modeling 2019; 59(7): 3166-76.
- Cheplygina V, De Bruijne M and Pluim JP: Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis. Medical Image Analysis 2019; 54: 280-96.
- Dhruba SR, Rahman R, Matlock K, Ghosh S and Pal R: Application of transfer learning for cancer drug sensitivity prediction. BMC Bioinformatics 2018; 19: 51-63.
- Li Z, Yang Z, Luo L, Xiang Y and Lin H: Exploiting adversarial transfer learning for adverse drug reaction detection from texts. J of Biomed Inf 2020; 106: 103431.
- Yasonik J: Multiobjective de novo drug design with recurrent neural networks and nondominated sorting. Journal of Cheminformatics 2020; 12(1): 14.
- Wu K and Wei GW: Quantitative toxicity prediction using topology based multitask deep neural networks. Journal of Chemical Information and Modeling 2018; 58(2): 520-31.
- Li L, He X and Borgwardt K: Multi-target drug repositioning by bipartite block-wise sparse multi-task learning. BMC Systems Biology 2018; 12: 85-97.
- Yang M, Simm J, Lam CC, Zakeri P, van Westen GJ, Moreau Y and Saez-Rodriguez J: Linking drug target and pathway activation for effective therapy using multi-task learning. Scientific Reports 2018; 8(1): 8322.
- Weng Y, Lin C, Zeng X and Liang Y: Drug target interaction prediction using multi-task learning and co-attention. In2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2019; 528-533.
- Han Y, Liu M and Jing W: Aspect-level drug reviews sentiment analysis based on double BiGRU and knowledge transfer. IEEE Access 2020; 8: 21314-25.
- Chen CT and Gu GX: Generative deep neural networks for inverse materials design using backpropagation and active learning. Advanced Science 2020; 7(5): 1902607.
- Reker D: Practical considerations for active machine learning in drug discovery. Drug Discovery Today: Technologies 2019; 32: 73-9.
- Ding J, Condon A and Shah SP: Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nature Communications 2018; 9(1): 2002
- Arús-Pous J, Patronov A, Bjerrum EJ, Tyrchan C, Reymond JL, Chen H and Engkvist O: SMILES-based deep generative scaffold decorator for de-novo drug design. Journal of Cheminformatics 2020; 12: 1-8.
- Cai R, Ren A, Liu N, Ding C, Wang L, Qian X, Pedram M and Wang Y: VIBNN: Hardware acceleration of Bayesian neural networks. ACM SIGPLAN Notices 2018; 53(2): 476-88.
- Shridhar K, Laumann F and Liwicki M: A comprehensive guide to bayesian convolutional neural network with variational inference. arXiv preprint arXiv:1901.02731. 2019; 1-38.
- Julkunen H, Cichonska A, Gautam P, Szedmak S, Douat J, Pahikkala T, Aittokallio T and Rousu J: Leveraging multi-way interactions for systematic prediction of pre-clinical drug combination effects. Nature Communications 2020; 11(1): 6136.
- Attwood MM, Fabbro D, Sokolov AV, Knapp S and Schiöth HB: Trends in kinase drug discovery: targets, indications and inhibitor design. Nature Reviews Drug Discovery 2021; 20(11): 839-61.
- Jiménez-Luna J, Grisoni F and Schneider G: Drug discovery with explainable artificial intelligence. Nature Machine Intelligence 2020; 2(10): 573-84.
- Askr H, Elgeldawi E, Aboul Ella H, Elshaier YA, Gomaa MM and Hassanien AE: Deep learning in drug discovery: an integrative review and future challenges. Artificial Intelligence Review 2023; 56(7): 5975-6037.
- O’Connor TF, Lawrence XY and Lee SL: Emerging technology: A key enabler for modernizing pharmaceutical manufacturing and advancing product quality. International Journal of Pharmaceutics 2016; 509(1-2): 492-8.
- Bedi P, Sharma C, Vashisth P, Goel D and Dhanda M: Handling cold start problem in Recommender Systems by using Interaction Based Social Proximity factor. In2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI) 2015; 1987-1993.
- Aittokallio T: What are the current challenges for machine learning in drug discovery and repurposing? Expert Opinion on Drug Discovery 2022; 17(5): 423-5.
How to cite this article:
Verma M, Hardenia S and Jain DK: Machine learning strategies for drug discovery and development. Int J Pharm Sci & Res 2025; 16(5): 1194-08. doi: 10.13040/IJPSR.0975-8232.16(5).1194-08.
All © 2025 are reserved by International Journal of Pharmaceutical Sciences and Research. This Journal licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Article Information
7
1194-1208
1047 KB
30
English
IJPSR
Muskan Verma, Shiv Hardenia * and Dinesh Kumar Jain
IPS Academy College of Pharmacy, Knowledge Village, Rajendra Nagar, A.B. Road, Indore, Madhya Pradesh, India.
shivsharma280485@gmail.com
18 November 2024
18 December 2024
22 December 2024
10.13040/IJPSR.0975-8232.16(5).1194-08
01 May 2025