It takes too long for machine learning models to diffuse to diseases of the poor
Caleb K. Kibet and Geoffrey H. Siwo
The history of data analysis can be traced back to ancient Egypt where periodic census for building pyramids gave birth to statistics. Fast forward to the last century, the invention of digital technologies and the subsequent exponential growth in computing power have accelerated the pace at which we can draw insights from data. In addition, the rapid growth in both data and computing speed has enabled the development of more sophisticated mathematical, statistical, machine learning and artificial intelligence methods. As a tool, data science empowers us to obtain insights that could lead to life-changing advancements. Therefore, disproportionate use of advanced data science tools across the world could lead to inequalities in progress and access to medical interventions.
In this article, we explore the question: how quickly are machine learning and artificial intelligence algorithms adopted in malaria — a disease that primarily affects the poor — vs one that predominantly affects those in wealthy nations: cancer? Very slowly. According to our estimates, it takes an average of 11 years for a data analysis approach in cancer research to diffuse to malaria research (Table 1). We share this article on Medium instead of a biomedical journal or pre-print server to reach out to a broader community of data scientists including those without a biomedical background and elicit collaborations in enhancing diffusion of knowledge and technologies across domains. We provide code on our MachineLearning4Malaria repo on Github to support the central claims in this article, obtain open critical review and invite feedback for future peer-reviewed research in this area.
To estimate the knowledge diffusion of various machine learning algorithms in cancer vs malaria, we queried the public repository of biomedical literature hosted by the US National Library of Medicine (PubMed) to identify all papers that mention the standard machine learning algorithms when presenting research on malaria or cancer in their abstracts. A summary of the results are presented in Table 1. For comparison, we also explored how major biomedical innovations such as DNA sequencing diffuse to cancer vs malaria (Table 2).
It appears that the introduction of machine learning in general into malaria research initially took the longest period (18 years) compared to in cancer studies. K-nearest neighbor (KNN) also took a similar amount of time. Notably, convolutional neural networks (CNN) which have emerged in the last decade show only a 5-year time lag from cancer to malaria. The increasing pace at which information flows today accompanied with open sharing of data and code has probably played a key role in this.
Simple statistical approaches such as linear regression are the most commonly used data analysis methods across both cancer and malaria. Primary data analysis methods like linear and logistic regression are well adopted by African researchers compared with the advanced tools. Linear and logistic regression methods are widely used for identifying relationships between multiple factors for categorical and continuous variables, respectively. The use of these approaches is however declining (Figure 1, additional figures on MachineLearning4Malaria). In malaria, the use of linear regression peaked in 2015 and is on the decline as new approaches in machine learning are diffusing more into the field.
Machine learning technologies have gained traction in biomedical research, driven by the increase in biological data generated and compounded by the advent of high-throughput sequencing and microarray technologies. Machine learning enables scientists to derive more value from the complex data, and allows a combination of a wide array of data, especially with the maturity of deep learning and unsupervised learning algorithms. The use of some machine learning algorithms that have been in the field for a longer time is declining. For example, support vector machines, and random forests use in malaria publications peaked in 2015 and have been on a decline ever since. Neural networks are gaining popularity.
In general, the diffusion of other biomedical technologies from well studied diseases such as cancer to neglected diseases is expected to be slow. To assess this, we also performed an analysis of publications involving various biomedical technologies in cancer and malaria (Table 2). On average, there is a 7-year delay from the time these technologies are first applied on malaria compared to cancer. Sanger sequencing, one of the first methods for DNA sequencing took 17 years to feature in a malaria publication compared to its first application in cancer research. However, we also found that a few widely applied technologies that are relevant in infectious disease detection appear first in malaria before cancer. For example, Enzyme Linked Immunosorbent Assay (ELISA), a method for detecting antigens using antibodies, was applied in malaria 2-years before its appearance in a cancer publication. Microarray technology had only a 4-year lab in malaria vs cancer research while the latest sequencing technology (nanopore sequencing) has taken 5-years to show up in a malaria publication.
Malaria remains a burdensome disease plaguing the African continent. It is a disease that primarily afflicts of the poor. In 2017, about 92% of the 219 million malaria cases were reported in Africa (WHO, World Malaria report 2018). In this article, we explore the diffusion of technologies, specifically machine learning, to malaria research, using cancer as a reference. Cancer is considered as the disease of the rich: 36% of the 18 million new cases in 2018 are reported in Europe.
As a disease of the poor, malaria receives significantly less funding compared with cancer research. WHO notes that to achieve SDG goals, for a 40% reduction in malaria incidences by 2010, we need about 4.4 Billion USD, against 3.1 invested in 2017, a 1.3 billion deficit. Also, a majority of malaria researchers come from Africa, a continent which still struggles with funding, access to technology, and interdisciplinary collaboration. Without multidisciplinary collaboration, the techniques from the computer science fields take longer to be adopted. Most researchers in Africa still apply traditional technologies to malaria research.
Generating data is costly. African research lacks human, financial and technological expertise to generate the data. Compounded with the lack of collaboration, this contributes to the slow diffusion of machine learning technologies to malaria research. As a disease that burdens the African continent, Africa needs to take the lead in finding solutions to its challenges.
We also explored the question, who is driving the adoption of machine learning technologies in malaria research? To do this, we examine the affiliation of the first author. Let’s use papers mentioning machine learning and malaria as an example. Only eight of the 73 papers are from an African country, the majority are affiliated with institutions in the USA, India and Australia. African researchers are not driving the adoption of machine learning algorithms in malaria research: this needs to change.
To increase the adoption of machine learning algorithms in malaria research, these algorithms and technologies need to be rapidly adopted and developed by the researchers conducting malaria research: African researchers. Researchers use what is in their toolbox to address questions of interest, and the majority of the African researchers rely on molecular techniques. Therefore, we need to build machine learning capacity for African researchers, especially the young; through hackathons, workshops and curriculum change. Secondly, we need to encourage interdisciplinary collaborations in Africa and other parts of the world, stop working in silos, and make biomedical research attractive to computer and data scientists.
Adopting open science in biomedical research is one approach to catalyze collaborations between biomedical researchers and computer scientists. Through open science seminars, workshops and a hackathon in Nairobi, Kenya, which equipped local researchers with open science tools, we have observed demand for training in data science skills, and how to apply it to biomedical research. This April, we challenged data scientists at the Deep Learning Indaba X in Durban, South Africa to develop deep learning models for classifying zoonotic and non-zoonotic viruses based on viral genome sequences. In 2016, we organized the DREAM of Malaria Hackathon at the IBM Research Africa lab in Johannesburg, South Africa, bringing together modelers from various countries to assess the utility of genomic datasets in predicting emerging drug resistance to artemisinin, a key anti-malarial drug. We are extending these efforts to an open international data science challenge — The Malaria DREAM challenge — launched April 30th 2019 to invite computational models for predicting emerging artemisinin drug resistance using genomic data from malaria parasites. We would like to make a special appeal for the participation of data scientists in Africa and will hold a local event on 3rd June 2019 at IBM Research Africa lab in Nairobi, Kenya to enhance participation of data scientists within Africa. Collectively, these efforts are increasing the integration of machine learning and artificial intelligence technologies into diseases of the poor. More needs to be done. We can only collaborate if we and our collaborators are equipped with the open science, collaboration tools.
Caleb K Kibet is a bioinformatician at the International Center for Insect Physiology and Ecology (ICIPE) in Nairobi Kenya. He is also Founder of Open Science Kenya. Twitter- @Calkibet
Geoffrey H. Siwo is a research assistant professor at the Center for Research Computing and Eck Institute for Global Health, University of Notre Dame, IN, USA. Twitter- @gsiwo