"Correcting Errors in PacBio and ONT Assemblies using High Definition Mapping"

February 27, 2019

John S. Oliver, Barrett Bready, Anthony P. Catalano, Jennifer R. Davis, Michael D. Kaiser, Jay M. Sage

De novo whole genome assemblies based on short-read sequencing data are often incomplete and highly fragmented. The development of long-read, single-molecule technologies, like those produced by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), were driven by the need for longer read lengths to span repeat regions and complex events. While significant improvements in assembly have been observed with the application of these technologies, both have high per-read error rates resulting in frequent assembly errors and are unable to achieve sufficient read length to observe all genomic structural changes. Scaffolding methods, such as optical mapping or Hi-C methodologies, have been used in combination with sequencing technologies for assembly improvement, but suffer from inherent resolution limitations and high cost. Complete, accurate and cost-effective genome assembly continues to be a problem, even for small microbial genomes.

To provide the necessary long-range information while maintaining sufficient resolution to complement sequencing technologies, Nabsys has developed the HD-MappingTM platform to construct high-resolution whole genome maps. By analyzing reads that are hundreds of kilobases in length, electronic detection preserves long-range information while simultaneously achieving unparalleled resolution and accuracy. Single-molecule reads have high resolution and low false-negative and false-positive error rates, resulting in high information content per read. To assess data quality, we present a de novo assembled map containing a known large tandem repeat and show concordance with the well-established reference.

To demonstrate the need for the long-range information provided by Nabsys HD maps for accurate assembly, we show examples of assembly errors generated with PacBio and ONT data from large and small genomes. Errors observed include collapsed repeats, false duplications/insertions, chimeric joins, and incorrect circularization junctions. Alignments between Nabsys assembled maps and sequence assemblies are presented to highlight regions of discrepancy.

Download Poster