Validating Microorganism DNA with AI: A Solution or a New Challenge?

Wed, 03 Jun 2026, 18:33 WIB | Source: DETIK | Technology

Imagine a vast library containing millions of books, but instead of text, it holds the genetic codes of billions of microorganisms. This is the reality of modern scientific microbial collections: a massive repository of DNA data that serves as the foundation for global health, environmental, and biotechnology research. However, much like a library where books can be damaged or misplaced, genetic data is vulnerable to errors, ranging from sample contamination and sequencing noise to inconsistent data formats.

Traditionally, the verification and validation of microbial scientific collections have been performed manually or using tools such as FastQC and NanoPlot. While effective, these methods are overwhelmed by the exploding volume of data resulting from next-generation sequencing (NGS) technology, where a single research project can generate hundreds of gigabytes of genomic data in mere days. This is where Artificial Intelligence (AI) enters as a solution. AI, specifically machine learning and deep learning, is capable of processing, evaluating, and validating DNA sequencing data at a scale impossible for humans. Algorithms can automatically clean data of missing values, detect anomalies in real-time, and even predict potential errors before they propagate through the analysis pipeline. This is not merely about efficiency; it is about the very reliability of science.

AI-based quality control in microbial DNA sequencing operates across three layers: raw data quality control, alignment quality control (base sequence alignment), and variant calling quality control. These three stages provide mutual protection; if data passes one filter with a hidden defect, the next layer is prepared to catch it, resulting in much cleaner and more reliable data for downstream analysis. One of the most promising innovations is a pipeline called QC-Blind, a quality control system specifically designed to handle contamination in DNA sequencing projects. Its distinguishing feature is that QC-Blind does not require a reference genome to function, allowing it to filter unknown contaminants while preserving the genomic information of the target species. Testing on various datasets, including both computer simulations and real laboratory samples, shows that QC-Blind successfully filters contaminants with high specificity and accuracy. This is particularly relevant for scientific collections dealing with complex environmental samples where the identity of contaminants is not always known beforehand.

Another crucial aspect is the validation of the AI models themselves. Researchers divide datasets into training, testing, and validation sets—typically in an 80:10:10 ratio—to ensure the model does not simply

View JSON | Print