The microbial world plays a fundamental role in shaping Earth's biosphere, steering global processes such as carbon and nitrogen cycling, soil rejuvenation, and ecological fortification. An overwhelming majority of microbial entities, however, remain unstudied. Metagenomics stands to elucidate this microbial “dark matter" by directly sequencing the microbial community DNA from environmental samples. Yet, our ability to explore these metagenomic sequences is limited to establishing their similarity to curated datasets of organisms or genes/proteins. Aside from the difficulties in establishing such similarity, the reference-based approaches, by definition, forgo discovery of any entities sufficiently unlike the reference collection.
Language model-based methods offer promising avenues for reference-free analysis of metagenomic reads. We developed REBEAN to predict Enzyme Commission number (EC) level 1 classes to identify enzymatic metagenomic reads. REBEAN is the fine-tuned version of pretrained DNA foundation language model, REMME. By emphasizing function over gene identification, REBEAN is able to label known functions carried both by previously explored genes and by new (orphan) sequences. REBEAN has shown to identify the functionally relevant parts of a gene even though it is not explicitly trained to do so and to increase annotation coverage of unassembled metagenomic reads than alignment-based annotation tools. In addition to REBEAN, this web platform also incorporates MeBiPred’s metal-binding prediction and mi-faser’s alignment-based EC prediction, thus providing a versatile platform for metagenomic data analysis and annotation.
Citing REBEAN: If you use REBEAN in published research, please cite:
R Prabakaran, Y Bromberg. (2024). Deciphering enzymatic potential in metagenomic reads through DNA language model. bioRxiv 2024.12.10.627786; doi: https://doi.org/10.1101/2024.12.10.627786
Citing MeBiPred:
Aptekmann AA, Buongiorno J, Giovanelli D, Ferreiro DU, Bromberg Y. (2021). MeBiPred: A powerful tool to discover metal binding proteins.
Citing mi-faser:
Zhu, C., Miller, M., Marpaka, S., Vaysberg, P., Rühlemann, M. C., Wu, G. H. F.-A., . . . Bromberg, Y. (2017). Functional sequencing read annotation for high precision microbiome analysis. Nucleic Acids Res. doi:10.1093/nar/gkx1209