Agenda

09 Feb 2023 11:00

Run-Length Compressed FM-Indexes and Maximal Exact Matches for Pan-Genomics Pattern Matching

Laboratorio ACADIA - edificio ZETA, Campus Scientifico via Torino

Speaker
Nathaniel K. Brown
, Dalhousie University, Canada

Abstract:
The FM-index is a success story of compact data structures, seeing wide-use in DNA alignment and used daily in clinical settings worldwide. Although the FM-index is powerful in reducing space whilst supporting queries efficiently on a few human genomes in a reference text, it cannot handle many at once due to poor scaling. This is of particular concern due to reference bias: using only a few standard references biases DNA alignment towards the ethnic origin of the reference or fails to capture the genetic diversity necessary to identify medical diagnoses. Pan-genomics concerns itself with approaches which can leverage many genomes to avoid these downfalls.

Pangenome graphs are an increasingly popular method to better capture genetic diversity, but we can also attempt to scale FM-indexes to include even thousands of genomes; a technically challenging problem, but one that gives different functionality than pangenome graphs. A recently successful approach has been FM-indexes using run-length compression, which leverages the tendency for repetitive strings (such as DNA) to have many ‘contiguous same character runs’ r within the data structure. Gagie, Navarro, and Prezza showed how we can quickly support locate queries efficiently in O(r)-space, and Nishimoto and Tabei improved this result to answer queries in optimal time given a polylog-sized alphabet.

We showed that this result can be made practical and generalized to run-length compressing permutations; of which the LF-mapping of the FM-Index can be considered. This ‘table-lookup’ approach is beginner friendly introduction to FM-indexes and run-length compression and how in O(r)-space we can support substring extraction through this LF-mapping property in optimal time for runs bounded methods. Although many r-index approaches don’t use table-based LF, computing it efficiently is necessary to support queries such as count and locate. Further, compressing this approach was shown to be comparable to existing O(r)-space methods even without guarantee of optimal time LF.

However, approximal pattern matching is often more important in bioinformatics than exact matching queries. Bannai, Gagie, and I showed how the r-index can compute maximal exact matches (MEMs) to support approximate pattern matching, and Rossi et al. presented a practical method, called MONI, which uses longest common extension (LCE) queries and ‘thresholds. MONI finds MEMs of a pattern P against a text P by first processing the positions of matches and then finding their lengths, but this can be done in one-pass by dropping thresholds for additional LCE queries; useful in online approaches such as targeted nanopore sequencing. However, by reintroducing ‘thresholds’ with an augmentation of additional LCE information at run boundaries, we can greatly speed up MEM-finding while still performing a one-pass approach.

Bio Sketch:
Nathaniel K. Brown is currently completing his Masters degree at Dalhousie University (expected April, 2023) under Travis Gagie while also completing a visitation at Johns Hopkins University under Ben Langmead in Dec. 2022, focusing on practical improvements in compact data structures. His work on table-based LF was presented at SEA ’22, with further work on MEM-finding to be presented at DCC ’23.

Lingua

L'evento si terrà in inglese

Organizzatore

Nicola Prezza

Cerca in agenda