‘Let’s talk about machine learning vs statistics’

Elizabeth Merrall, consultant statistician, considers what machine learning brings to the healthcare sector.

As the capacity to collect, store and access data have grown over the last few decades, it is no understatement that data science and machine learning are buzz words. As a classically trained statistician working in pharma, I have been cautious but curious to venture into the use of and toying with these ever-expanding and exciting tools.

Machine Learning

Machine learning refers to techniques that use data to learn, from examples as opposed to rules, and make predictions; and are described very nicely by Isaac Kohane, from Harvard Medical School, in his recent NEJM interview here.

Machine Learning Binary Classifier Model

At its simplest, this could be a binary classification problem to identify sick versus healthy individuals, using a logistic regression including several explanatory variables or features. Regression is one of the simpler tools available to the machine learning specialist. They range from complex to simple neural networks, made up of layers of classifying functions, that translate input data into a prediction. Towards the other end of the spectrum, search engines such as google use machine learning in various ways – probably their most obvious application is in their ranking of search results according to most relevance for the user. I.e. based on previous searches and clicking activity, they determine which results are most likely to be clicked in a search. Along this vein, machine learning departs from the more traditional hypothesis-driven approach, involving carefully selected variables, towards a more limitless and data-driven approach, entailing less human interaction.


The rise in the use of machine learning is not without its concerns. And, at the American Association for the Advancement of Science in Washington, Dr Genevera Allen from Rice University in Houston, spoke of, a reproducibility crisis that has been growing for two decades and has come about because experiments are not designed well enough to ensure that the scientists don’t fool themselves and see what they want to see in the results.’

On the other hand, the reproducibility of scientific findings is not a new issue (c.f. Ioannidis, 2005) and well-defined analysis and study design, consideration of biases in the data collection and critical review of results are as applicable as ever. In machine learning, the model is additionally evaluated by its accuracy in predicting. Hence, it is good practice to use a set of the data as a training set to create the model that relates explanatory variables or features to subsequent outcomes or labels. We can then see how well the model performs in predicting, using the remainder of the data as a test set. This can be repeated, training and testing with different sets of the data, to check that the model performs consistently and is not overfitting.

Real World Use Case

Moreover, machine learning can contribute in ways that more classical statistical analytics cannot. For example, in the processing of images and text, and mining of large datasets. For the latter, one convincing example of an interesting application was presented at Medidata’s recent Clinical Innovation day at Lundbeck, Copenhagen by Roche, where the mechanisms behind the side effects of one of their compounds were better understood, and as a result, treated, through the mining of relevant publicly available databases. More generally, in a recent article in the New England Journal of Medicine, colleagues at Google, Alvin Rajkomar and Jeff Dean, together with Isaac Kohane talk through a series of interesting possibilities for machine learning to enhance the work of clinicians with respect to evaluating prognoses, diagnoses, treatment decisions, compiling electronic health records and more efficient use of medical expertise.

Final Thoughts

In conclusion, while I have been skeptical – with these data-driven approaches going against my usual way of thinking, I am now a firm believer that machine learning has its own role to play in medical applications. As a medical tool, it is still in its infancy and faces similar challenges already experienced by statisticians, in establishing best practices and ways of working with stakeholders. Of course, there will be pitfalls along the way and machine learners can benefit from exchanges of experiences with statisticians to mitigate some of these. All in all, I’m excited for this field and looking forward to seeing these methods strengthen and optimize the healthcare of our future.

About the author:

Elizabeth is part of a team of statisticians at S-cubed that provide statistical expertise and support to pharmaceutical and biotech clients, throughout all phases of development. If you’re interested in discussing your applications of machine learning in the healthcare sector, or statistical aspects of the design, analysis and reporting of your clinical studies or submissions, please reach out to us through the website, email info @ s-cubed-global.com or call us on +45 31 45 29 16