Alternative K-mer–Based Approaches Offer Faster Clostridioides difficile Genome Analysis, but Leave Room for Improvement

April 18, 2022

Article

As more bacterial DNA sequences are identified, the computational challenges associated with whole-genome sequencing have researchers looking to other potential methods of tracking and understanding differences in bacterial pathogens such as Clostridioides difficile.

Whole-genome sequencing (WGS) has been pivotal in tracking and understanding bacterial pathogens such as Clostridioides difficile (C diff), but comparing a large variety of DNA sequences takes a massive amount of computational power that can prove a limiting factor in large-scale analyses.

A study published in Microbial Genomics tested the performance of the dimensionality reduction technique MinHash in predicting single nucleotide differences between genomes (SNPs) and C diff ribotypes (RTs).

WGS facilitates outbreak investigations and infection control measures and surveillance, but as the bank of previously sequenced genomes grows, the process of identifying closely genetically related infections becomes more intensive from a computational standpoint. Most studies have relied on reconstructing phylogenies based on SNPs, which becomes less practical with larger data pools.

Tactics such as multi-locus sequencing typing (MLST) and core genome MLST (cgMLST) can more rapidly type isolates without the mapping or assembly involved in WGS, but MLST lacks the precision necessary to clarify whether sequences of the same sequence type (ST) constitute a likely outbreak. And there are multiple cgMLST schemes for C diff that are not inter-operable with open-source software. Another non-WGS option for C diff is polymerase chain reaction (PCR) ribotyping, but this method is limited because identifying PCR ribotypes based on short reads of WGS is a challenge. There is also not a perfect 1:1 correspondence between ribotypes and STs.

Given the drawbacks of the above methods, authors of the current study examined alternative approaches based on k-mers — fragments of sequence data of length k — for screening isolates to identify subsets with closely related genome sequences. Decomposing WGS reads or assemblies into k-mers and using MinHash makes it possible to approximate genomic distances without alignment and in a timelier manner.

“These approaches are potentially species agnostic, and could be deployed widely, without the need for species-specific schemes such as MLST, cgMLST or ribotyping,” study authors wrote. “They also have the advantage that they do not necessarily require prior genome assembly or alignment as required for cgMLST or SNP based analyses, respectively.”

The study assessed the precision of k-mer compared with SNP distances using a set of 1905 diverse C diff genomes that ranged from 0 to 168,519 SNPs different from one another.

MinHash, implemented by sourmash, was used to screen for closely related genomes. Sourmash was used at a sensitivity of 100% for pairs with 10 or less SNPs and reduced the overall number of pairs from 1,813,560 to 161,934 (a 91% reduction) and had a positive predictive value of 32% to identify pairs with 10 or less SNPs. Maximum SNP distance in this case was 4144. At 95% sensitivity, pairs reduced to 108,266 (a 94% reduction overall), the positive predictive value was 45%, and maximum SNP distance was 1009.

“In a diverse dataset, clustering genomes by MinHash could rapidly exclude the majority of dissimilar genome pairs from further alignment and fine-scaled analysis,” the authors wrote. “The genomes clustered by MinHash were comprised of more similar pairs, comparable with genomes clustered by fractional typing schemes such as ribotyping.”

A rapid distance-based ribotype prediction method based on 3937 genomes with known ribotypes was also tested. A training set of 2937 genomes were used to construct a sourmash index, then 1000 genomes in the test set were compared. The results were considered predictive if the closest 5 genomes in the index had the same ribotype as the searched genome. The MinHash ribotype index made correct predictions for 78% of the test genomes, incorrect in 2%, and indeterminant in 20%. The correct predictions jumped to 87% when the classifier was relaxed to 4 of 5 closest matches with the same ribotype.

While MinHash significantly reduces the necessary computational power for comparing C diff genomes, the k-mer–based approached overall performed modestly, the authors concluded. It did not introduce substantial error compared with using core genome SNP distances, and there is likely still room for improvement in this method.

“In a genomic surveillance context where hundreds or thousands of genomes for comparison are becoming routine, it does provide the opportunity to computationally inexpensively and rapidly subset genomes for alignment and outbreak detection while full, fine-scaled investigations are ongoing,” the authors wrote.

Reference

Moore MP, Wilcox MH, Walker AS, Eyre DW. K-mer based prediction of Clostridioides difficile relatedness and ribotypes. Microb Genom. Published online April 6, 2022. doi:10.1099/mgen.0.000804