bion OPTIMA: Index-based map-to-sequence alignment in large eukaryotic genomes
OPTIMA

OPTIMA: Sensitive and accurate whole-
genome alignment of error-prone genomic
maps by combinatorial indexing and
technology-agnostic statistical analysis

Keywords: Optical mapping, genomic mapping, glocal alignment, overlap alignment,
map-to-sequence alignment


Background: Resolution of complex repeat structures and rearrangements in the
assembly and analysis of large eukaryotic genomes is often aided by a combination
of high-throughput sequencing and genome mapping technologies (e.g. optical
restriction mapping). In particular, mapping technologies can generate sparse
maps of large DNA fragments (150 kbp-2 Mbp) and thus provide a unique source
of information for disambiguating complex rearrangements in cancer genomes.
Despite their utility, combining high-throughput sequencing and mapping
technologies has been challenging due to the lack of efficient and sensitive map
alignment algorithms for robustly aligning error-prone maps to sequences.


Methods: Here, we introduce a novel seed-and-extend glocal alignment method,
called OPTIMA (and a sliding-window extension for overlap alignment, called
OPTIMA-Overlap), that is the first to be able to create indexes for
continuous-valued mapping data while accounting for mapping errors. We also
present a novel statistical model, agnostic to technology-dependent error rates,
for conservatively evaluating the significance of alignments without relying on
expensive permutation-based tests.


Results: We show that OPTIMA and OPTIMA-Overlap outperform other
state-of-the-art approaches in sensitivity (1.6-2x improvement) while
simultaneously being more efficient (170-200%) and precise in their alignments
(nearly 99% precision). These advantages are independent of the quality of the
data, suggesting that our indexing approach and statistical evaluation are robust
and provide improved sensitivity while guaranteeing high precision.


Software

Source code freely available on GitHub: OPTIMA v.f-1.3.

Direct links to the dependencies can be found at commons-math3-3.2.jar and
cern.jar.

Running sample (requires Java JDK 7+ only):

Datasets

Benchmarking and real datasets used in the study are available in our
data repository.

Snapshots of the code and benchmarking and real datasets are also available
from the GigaScience GigaDB database.

Feedback

For questions and suggestions about the project, please contact Davide Verzotto
or Niranjan Nagarajan.

Competing interests

This work was partly supported under a research collaboration agreement with
Sciencewerke Pte. Ltd., the Singapore distributor for OpGen Inc. No employees
of Sciencewerke or OpGen played a role in the work described here.
Davide Verzotto and Niranjan Nagarajan are inventors on a patent application
related to this work.

References

Please cite the following articles and resources:

Davide Verzotto*, Audrey S.M. Teo, Axel M. Hillmer, Niranjan Nagarajan*:
"OPTIMA: Sensitive and accurate whole-genome alignment of
error-prone genomic maps by combinatorial indexing and
technology-agnostic statistical analysis."
GigaScience 2016, 5:2. 10.1186/s13742-016-0110-0 PDF

Audrey S.M. Teo, Davide Verzotto, Fei Yao, Niranjan Nagarajan, Axel M. Hillmer*:
"Single-molecule optical genome mapping of a human HapMap and a
colorectal cancer cell line."
GigaScience 2015, 4:65. doi:10.1186/s13742-015-0106-1 PDF

Davide Verzotto*, Audrey S.M. Teo, Axel M. Hillmer, Niranjan Nagarajan*:
"Index-based map-to-sequence alignment in large eukaryotic genomes."
In the Proceedings of the Fifth RECOMB Satellite Workshop on Massively Parallel
Sequencing, RECOMB-Seq 2015, Warsaw (Poland). doi:10.1101/017194 PDF

Davide Verzotto*, Niranjan Nagarajan:
"Index-based map-to-sequence alignment in large eukaryotic genomes."
Patent Application 10201502027V. Intellectual Property Office of Singapore 2015.
*75% contribution

Davide Verzotto*, Audrey S.M. Teo, Axel M. Hillmer, Niranjan Nagarajan*:
"Supporting software for OPTIMA, a tool for sensitive and accurate
whole-genome alignment of error-prone genomic maps by combinatorial
indexing and technology-agnostic statistical analysis."
GigaScience Database 2015. doi:10.5524/100165

Audrey S.M. Teo, Davide Verzotto, Fei Yao, Niranjan Nagarajan, Axel M. Hillmer*:
"Supporting single-molecule optical genome mapping data from human
HapMap and colorectal cancer cell lines."
GigaScience Database 2015. doi:10.5524/100182