The 12 most extreme cases, with only 0–4 HMM detections over 1051–1808 bp, were all identified as taxonomic misclassifications and represented eukaryotic 18S rather than bacterial or archaeal 16S sequences. This prevented detection by the domain-specific HMMs, although some HMMs that were designed at highly conserved regions were able to perform detections across taxonomic domains. Among the 92 less extreme cases, with 6 to 9 HMM detections over 900–1504 bp, most sequences (i.e. 75 cases) contained a sequence segment at either the 5′ or
the 3′ end that did not match any entry in GenBank, as assessed through blast. We extracted these segments from 15 entries and subjected them to a separate blast analysis. In 11 cases, the segment alone showed no reasonable match to any entry in GenBank, indicating that the segment probably represents erroneous sequence information. selleck inhibitor In the other four cases, the segment matched entries other than the matches from the full blast search, indicating that the entire sequence is probably chimeric. Eight sequences were chimeric, which might have reduced the number of HMM detections per read length equivalent. It is noteworthy in this case that most cases (76 out of 92) were Palbociclib ic50 flagged as being potentially chimeric in the SILVA database (average SILVA pintail score of 1.7%). In conclusion, the software showed extremely high detection reliability and flagged sequences
containing anomalies that can be detected by the algorithm such as reverse complementary chimeras or non-16S sequence information. Automated detection of the sequence
orientation might be particularly useful for environmental sequence data sets generated by high-throughput sequencing (HTS) techniques. However, the reduced length might affect detection reliability and speed could be a limiting factor in processing millions of reads in a reasonable time. In order to assess the performance of v-revcomp on HTS data, we extracted 332 835 and 13 876 V1-V2 subregions as well as 332 799 and 13 870 V1-V3 CYTH4 subregions from the bacterial and archaeal SILVA datasets using v-xtractor 2.0 (Hartmann et al., 2010). These two datasets simulate sequence lengths approximately equivalent to lengths generated by the current HTS platforms (V1-V2, 261±18 bp) and lengths that will likely be reached by the next-generation of HTS platforms (V1-V3, 481±22 bp). The bacterial V1-V2 and V1-V3 datasets were processed in 18 and 37 min, respectively, whereas both archaeal datasets took around 1 min. All sequences were given in the correct orientation, but five V1-V3 or four V1-V2 were flagged as containing one reverse complementary HMM detection. These were cases already flagged in the full-length dataset. In conclusion, the tool performed well also for the short sequence reads characteristic of HTS datasets. The processing time increases linearly with the number of sequences and the million reads obtained from a full round of 454 pyrosequencing is processed in around one hour.