Big Data and Early Detection of Cancer
Big Data is great when there is signal, not so great if there isn't any signal
Big Data is a popular topic in this age of machine learning and artificial intelligence. There is plenty of activity in the healthcare realm, not the least of which is in cancer research.
An example of big data's failure: IBM Watson Health
Last spring, at the American Association for Cancer Research conference there was an interesting session chaired by Dr. Lynda Chin entitled "Artificial Intelligence-enabled Cancer Care and Research: Potential and Challenges". In that session she presented a talk entitled "Big data and AI: developing cognitive applications in medicine". It was a very popular talk, given that she oversaw a large pilot effort to implement IBM's multi-billion dollar investment in Watson Health into MD Anderson's healthcare system.
Four years of work, and $62 million dollars later, this effort failed. There were many lessons Dr. Chin highlighted in her talk, which was surprisingly positive in terms of lessons learned. (The news about the cost and methods of spending those kinds of sums outside the purview of the MD Anderson administration would come out later.)
She pointed out that with their training and test set methodology utilized in implementing Watson Health, it was the quality of the underlying training and test set data that is paramount. A SME (subject matter expert, in this case an experienced oncologist with a decade or more of experience) provides excellent data for this exercise, but nonetheless such 'real world data is messy and noisy'.
At the end of this exercise and many others in smaller settings, IBM Watson Health is on the retreat, recognized as a multi-billion dollar investment that hasn't paid off (yet) for IBM.
Big Data, One Main Lesson
Some practical lessons in the data science realm outlined here are informative. They are: Data quality matters, Data integration is key, Lean and continuous delivery, and Regular model training & performance monitoring
Without the first lesson, the paramount importance of Data Quality, there is no data to integrate, there is no data to deliver, there isn't a model to improve. This may seem patently obvious, yet can be easily lost: looking for a signal in a sea of noise will be a fool's errand.
And the news coming out of MD Anderson at that time was not favorable from a project management point of view: "Hospital Stumbles in Bid to Teach a Computer to Treat Cancer" reads one Wall Street Journal headline. Regarding patient care, "IBM Watson Reportedly Recommended Cancer Treatments That Were 'Unsafe and Incorrect'" reads a recent Gizmodo headline reporting on a meeting that took place later in the summer of 2017.
Perhaps machine learning's potential to apply state-of-the-art cancer care cannot be realized at this time, given the limitations of the kinds of data used to train and test the learning model. Naturally being able to have EHR (electronic health record) systems that can be integrated into the Watson software would have been important. (This was not possible in the MD Anderson implementation, which came out as part of the audit.)
Dr. Chin made an interesting comment in her presentation about such sophisticated systems: "AI systems are taught, not programmed, by examples of 'right decisions'; the ground truth needs to be true." Any inconsistency in the signal, or noise that drowns out the signal, will have profound effect on the end result.
Big Data and early cancer detection
With all the potential of Big Data, machine learning and artificial intelligence, several efforts are underway. For example, GRAIL is enrolling 15,000 participants for its prospective Circulating Cell-free Genome Atlas study, having started in August of 2016 and continuing through 2019. An even larger study, with 120,000 participants for early detection of breast cancer, is underway called The STRIVE Study: Breast Cancer Screening Cohort for the Development of Assays for Early Cancer Detection.
Another early-detection firm, Freenome, is enrolling 3,000 participants in a study they call Specimen Collection Study for Cancer examining colorectal cancer which started earlier this year and will conclude next year.
The idea behind both these companies is the same: to "identify patterns of cell-free biomarkers in the blood to detect cancer early." In GRAIL's case, they are looking at cell-free circulating tumor DNA (targeted and whole-genome), and a variety of mutation types (single nucleotide variants, insertion-deletion mutations, copy number variants and structural variants). They are also looking at whole-genome methylome sequencing and have reported at major conferences, namely AACR in April and ASCO in June, specific numbers for sensitivity and specificity in their ongoing CCGA study.
These two press releases on preliminary data share similar results: at a high confidence interval (95% in the April data, 98% in the June set), for early-stage lung cancers (Stage I-IIIA) a detection rate of 50% or 51% using their targeted assay. Their whole-methylome assay (also called whole-genome bisulfite) was for early-stage lung cancer a detection rate of 65% or 41% respectively.
Is there enough cancer signal?
Whole-genome bisulfite sequencing is a very, very large task. As a matter of technical detail, any methylated cytosine residue is chemically modified in the bisulfite treatment to change it into a Uracil, which is sequenced as a thymidine reside. Thus a C>T transversion takes place in the sequencing data, and a four-base genome effectively becomes a three-base genome at these methylated CpG sites.
Lowering the complexity of four bases to three bases is a major headache for alignment algorithms, thus much more sequencing is needed to compensate. For example, in 2009 when the first human methylome was sequenced at base-pair resolution 87 and 91 gigabases of sequence was needed to cover 86% of both strands of the human genome by at least one read. This is an expensive and difficult undertaking, both in terms of bioinformatics as well as raw sequence data needed. And what can you do with one-base coverage with a signal that is analog like methylation status? (At a given CpG site, the methylation level may vary from 100% methylated all the way down to 0%.)
Given the 'big data' nature of the problem, can enough signal be generated to detect cancer early using whole-genome bisulfite sequencing? Going to the entire whole genome is a popular idea, and that tension between whole genome and targeted exomes (or smaller panels) has been ongoing for over ten years in the genomics field. (For more information and context around this debate, here's something I wrote a few years ago on the Thermo Fisher Behind the Bench Blog.)
The simple Singlera approach
Singlera has developed a targeted methylation assay that is efficient and effective in its simplicity. Instead of the broad (and difficult) approach of whole-genome bisulfite sequencing from cell-free DNA, Singlera is a targeted approach from a minimal amount of sample (20 ng of cell-free DNA, the amount typically in 2 or 3 mL of a plasma sample) with a lightweight single-day protocol, standard Illumina or Thermo Fisher Ion Torrent NGS (next-generation sequencing), and a fast bioinformatic pipeline.
The use of methylation haplotypes increases the complexity of the analyte while simultaneously simplifying the sorting of the signal. The PanSeer assay's existing results with the Taizhou Longitudinal Cohort samples for early-stage cancers (all types) was above 75%. See this PanSeer Technical Note PDF for information on both the Taizhou cohort and experimental details.
Singlera is currently looking for collaboration partners to further investigate the usefulness of this technology in the early detection of cancer. Contact us today if there's interest.