Sponsor: L'OREAL
Context
The rapid development of DNA sequencing technology has driven research related to genome assembly, revolutionizing the way we understand the microbiome and its intricate relationship with our planet and our lives, unlocking the secrets of metabolic pathways, leading to exciting biotechnological applications, and unraveling the connection between our microbiome and the health of our skin and scalp.
However, the living microbiome presents a significant scientific challenge due to its vast diversity, constantly fluctuating abundance, and rapid adaptation to environmental changes. Much of this microbial world remains shrouded in mystery, as current databases like RefSeq likely cover less than 5.3% of all species [1]. Moreover, current sequencing technologies generate massive amounts of short DNA fragments from these microbiomes, providing only a fragmented view of their genomes. Here, de novo assembly emerges as a crucial technique and a fundamental step in data analysis and the discovery of new species, bridging the gap left by fragmented sequencing data. By accurately reconstructing complete genomes from these sequences, de novo assembly computational method allows us to unravel the complexities of microbial diversity, functions, and interactions.
Traditional algorithms [2] struggle to keep up with this data deluge due to limitations in efficiency, scalability, and complexity. De novo assembly of genomes can take weeks on even the most powerful supercomputers a significant obstacle for time-sensitive applications like clinical diagnostics. To overcome these limitations, researchers are actively exploring innovative solutions, including those utilizing quantum and quantum-inspired computing [3,4,5,6]. These cutting-edge approaches hold immense potential to accelerate de novo assembly and unlock a deeper understanding of the complex microbial world, ultimately leading to breakthroughs in various scientific fields.
Objectives
The project's primary goal is to investigate the scalability and computational efficiency of quantum informatics algorithms designed for de novo genome assembly from short DNA fragments (reads). Genome assembly can be represented as finding the Eulerian path within a graph where nodes represent reads and edges connect reads with maximal overlap. Therefore, the project will explore the potential of quantum optimization algorithms, running on either quantum computers or emulators, leveraging a Traveling Salesman Problem (TSP) formalism to identify the Eulerian path within L'Oréal's provided data. The algorithmic performance, efficiency and scalability developed in the project will be benchmarked against L'Oréal's provided state-of-the-art (SOTA) benchmarks, spanning various sizes and complexities, thus elucidating the advantages and the pecifications of the quantum and HPC requirements for the de novo assembly across different biological system sizes.
References
[1] Louca, S., Mazel, F., Doebeli, M., Parfrey, L.W.: A census-based estimate of earth’s bacterial and archaeal diversity. PLoS biology 17(2), 3000106 (2019)
[2] Yue M. et al. Genome sequence assembly algorithms and misassembly identification methods. Molecular Biology Reports (2022) 49:11133–11148
[3] Boev, A.et al. Genome assembly using quantum and quantum-inspired annealing. Scientific Reports 11(1), 13183 (2021)
[4] Sarkar A. et al.: Quantum accelerated de novo dna sequence reconstruction. Plos one 16(4), 0249850 (2021)
[5] Nałęcz-Charkiewicz, K. et al. Algorithm for dna sequence assembly by quantum annealing. BMC bioinformatics 23(1), 122 (2022)
[6] Varsamis, G. et al.: Quantum algorithm for de novo dna sequence assembly based on quantum walks on graphs. Biosystems 233, 105037 (2023)