Single molecule detection of ctmDNA
The reason why early detection biomarkers for cancer is not a trivial task
First, second and third-generation sequencing
First-generation sequencing (also known as Sanger sequencing, or dideoxy-terminator sequencing) was the technology used for the Human Genome Project, and remains a workhorse and 'gold standard' technique for sequencing relatively small regions of the genome.
Next-generation sequencing (also known as massively parallel sequencing, or sometimes ensemble sequencing) is the technology that began in 2005 with the first Roche/454 GS-20 instrument, increasing the throughput relative to the existing Sanger / Capillary Electrophoresis method some 500-fold at the start, soon scaling in throughput to a 1000-fold over an Applied Biosystems 3730xl at maximum 96-well throughput.
Third-generation sequencing (also known as single-molecule sequencing) is the technology that made a splash in 2008 at the Marco Island Advances in Genome Biology and Technology Conference, in the years following Pacific Biosciences and Oxford Nanopore have produced single-molecule, third-generation sequencing systems for the market, and have found niche applications in genome assembly for novel organisms.
Starting from a single molecule
In both next-generation sequencing (NGS) and third-generation sequencing, the principle that the process begins with a single library molecule is widely accepted. DNA (or double-stranded complementary DNA reverse-transcribed from RNA) needs to have adapter molecules attached to both ends. The methods are commonly ligation, or fusion PCR, or even transposases. And via NGS a cluster is formed from a single molecule then amplified; via third-generation systems the adapter molecule is interrogated directly.
This is a critical point, often glossed-over: there are massive inefficiencies from the native sample through to what is read out on the sequencing instrumentation. This can be illustrated through a popular area of research, single-cell analysis.
Why single-cell analysis?
Single-cell analysis has become an increasingly popular topic, certainly on the leading edge of research and applied biology. The field of pre-implantation genetic diagnostics (PGD) has been working with single cells for many years, typically looking for single-gene mutations via multiplex PCR from embryos at the 8 to 11 cell blastomere stage. Recently the burgeoning study of stem cells and induced pleuripotent stem cells has accelerated development of techniques for single cells; a few years ago Fluidigm's C1 single cell analysis system became popular, and succeeded by 10X Genomics' Chromium system. In addition, a new focus on Circulating Tumor Cells (CTCs) and cancer stem cells as a form of liquid biopsy has increased attention on single cell analysis even further.
In a single cell, containing only 3.2 pg of DNA, there is only one copy available for analysis. (Lay aside discussion of haploid versus diploid for the moment.) With a single copy, once that copy is lost it cannot be recovered. Thus several techniques have been developed for amplification of the single-cell genome, even before further analysis and enrichment downstream.
The coverage is not even, mapping rates may only be 10%, and per the reference above, in addition there are many trade-offs between the amount of 'gain' to the amount of noise in the data. As no process is 100% efficient nor 100% accurate, errors can and will appear only to be magnified, and measured precisely, downstream; the problem is not in the measurement, but all the preliminary steps taken before that measurement. Bias in coverage (i.e. the number of regions covered in excess of the mean versus the number of regions undercovered or not covered at all) is a genuine problem with these techniques.
Nonetheless single-cell analysis is a very important emerging topic of study in the field of genomics and research; a technical area of expertise covering fields as diverse as neurobiology, oncology, aging and genetics.
What is molecular yield?
One concept to introduce is 'molecular yield'. It is the percentage of the molecules available from a given input sample to a sequencing read that comes from a second- or third-generation sequencing platform. An input sample can range in amount from several thousand cell's worth of genomic material (a nominal PCR input amount is 50 ng, about 15,600 genomic equivalents).
For sensitive applications such as single-cell analysis, the input amount will be on the order of 3.2 pg of haploid human DNA; for techniques such as Long Fragment Read technology the input amount is diluted even below a single cell genomic equivalent.
The molecular yield would be the overall percentage of the input molecules appearing as distinct, unique reads as output from the NGS system (again 'NGS' in this context could be 2nd or 3rd-generation sequencing platforms). With all the ligation and PCR steps involved with library construction, the potential for measuring the same molecule over and over again as a sequencing read (called "PCR Duplicates") is a real threat. The use of molecular barcodes (also known as Unique Molecular Identifiers - UMIs - or Unique IDentifiers or UIDs) for individual identification of unique input molecules will greatly assist in removing such PCR duplicate artifacts in the analysis of the sequencing data.
When experimental approaches using molecular barcodes are done, overall molecular yield from input molecules to output reads can be measured. As mentioned previously, the inefficiencies of intervening molecular manipulation steps are frequently ignored: the efficiency of the ligation step, the efficiency of the reverse transcriptase (if starting from RNA), the efficiency of hybridization-capture (if that technique is used for enrichment), the efficiency of intermediate purification steps.
As these processes are all multiplicative in nature: if the first step is only 50% efficient and the next two steps are 80% and 60% efficient respectively, the overall yield is now 24%. Naturally for relatively large-input amounts (for example, a nominal 50 ng input for Whole Genome Sequencing library preparation), that represents a full 15,600 haploid human genome equivalents; selecting individual molecules for sequencing is not a concern with such a high level of redundancy.
Yet for single-cell sequencing, the efficiencies become paramount, thus special equipment for sample preparation (such as 10X Genomics or newcomer MissionBio) of single cells for enrichment and library preparation are needed. Otherwise, the molecular yield of single-molecule targets from single-cells is so low as to be unusable for library construction and sequencing readout.
Molecular Yield and cell-free DNA
Common cell-free DNA techniques suffer from the same challenges of efficiency laid out above. With only 10 ng of cell-free DNA per mL of plasma, this amount represents only about 3,200 haploid genome equivalents of target molecules. As it has been reported by both Guardant Health and Foundation Medicine their average allele frequency of circulating tumor DNA from the samples they collect is on the order of 0.5%, that represents only 16 copies of the mutant allele in the sample. And to measure 0.1% circulating tumor DNA, this represents only 3.2 copies of the target.
Assuming you have an assay that can convert 24% of the molecules into a sequencing read (three steps at 50%, 80% and 60% efficiency), at 0.5% there are only 3.8 copies present as sequencing reads; at 0.1% the 3.2 copies present in the 10 ng sample become 0.77 sequencing reads.
Remember that the molecular yield calculation is one that is not commonly measured nor reported. Sophisticated NGS laboratories will use 'library complexity' as a proxy for molecular yield, looking at measures of GC content and unique start sites of the sequencing read. One vendor New England Biolabs (NEB) will measure efficiency of ligation-adapters, and measure theirs only relative to other vendors (in figure 3 of this technical note). Remember this is only step 1 of a multi-step library construction process, and the adapter-ligation relative to their competition has 2x to 5x differences in efficiency. Thus the absolute efficiency in this chart isn't even measured.
Another vendor Swift Biosciences in this application note on Library Complexity (PDF) calculates a 10-20% molar yield, and a 1% complexity conversion calculation.
One alternative to the low molecular yield problem would be to simply collect more sample. One recent paper used a full 40 mL of plasma samples in order to analyze methylated cell-free DNA via microarray; collecting eight 10 mL blood-draw collection tubes for a single analyte does pose practical collection, processing and storage issues.
Singlera's technology is single-molecule sequencing
When you hear the phrase 'single-molecule sequencing' you think of a Pacific Biosciences (now being acquired by Illumina) Sequel, or an Oxford Nanopore MinION. Yet with a high molecular yield (on the order of 80%), Singlera uses 20 ng of cell-free DNA and detects thousands of methylated haplotypes and tens of thousands of CpG sites as unique sequence reads.
The key component is not emulsion equipment nor physical partitioning, but rather straightforward library preparation biochemistry optimized for molecular yield. Capable of analysing circulating tumor methylated DNA (ctmDNA), the Singlera technology is poised to significantly alter the early-detection landscape.
By having a rich collection of difficult-to-detect molecules, the ctmDNA haplotypes can then utilize the power of Big Data to effectively and efficiently differentiate between samples that have early cancer signals and those that do not. (This post, "Big Data and Early Detection of Cancer", explores this topic further.)
The consideration of molecular yield has very important implications for assay developers who are looking for subtle signals from a limited amount of input molecules. You cannot interpret something that you are not measuring in the first place. With all the manipulations needed to get from a native cell-free DNA molecule to an analyzed sequencing read, are you currently capturing enough molecules to find what you are looking for?
If you are interested in collaboration opportunities, please contact us.