By Vered Smith, Science Writer
Scientists have published a paper studying the evolution of SARS-CoV-2, using artificial intelligence (AI) to analyse over 40,000 genomes.
Diversity within the SARS-Cov-2 population is increasing quickly as it evolves. Unsupervised AI learning can be used to mine big data without specific models or presumptions. In this study, scientists used a previously established batch-learning self-organizing map (BLSOM) to compare oligonucleotide sequences of SARS-Cov-2 genomes, and cluster them based on similarity.
Unlike the well-established phylogenetic methods, BLSOM does not use sequence alignment. It is a new explainable AI method, so it can depict the mutations that drive the clustering, and consequently how the SARS-Cov-2 genome sequence changes over time. The scientists used BLSOM, histograms, and time-series analyses to try identify aspects of the genomes’ evolutionary process, and the mutations essential for clade separation.
They performed BLSOM analysis on oligonucleotides of different lengths (1 through 7 bases long, and 15 bases long) to achieve the number that would separate the clades with the highest accuracy. Performing BLSOM on oligonucleotides 15 bases long (15-mers) had good separation by clade, and corresponded primarily with the seven main clades and their internal divisions previously defined by the GSAID consortium.
To further analyse the changes in the viral sequence, the researchers plotted a logarithmic histogram of the occurrence level of over 250,000 15-mers per month, comparing the levels in the June population of viruses to the December population. Interestingly, 60 types of 15-mers drastically increased/decreased in population frequency.
Their analyses also identified nine of the ten known mutations assigned by the phylogenetic clustering method, as mutations involved in main clade separation. They explained details of six of these mutations, including their amino-acid changes and the clade territories where the mutated sequences are located.
Use of BLSOM
This method can analyse over five million sequences together, and has powerful visualisation capabilities. It also has very few sequencing errors, so does not need any special pretreatment to carry it out. Therefore, it is an effective way to complement the more traditional phylogenetic methods using sequence alignment, especially when analysing massive numbers of sequences.
Image Credit: Canva