MXSCRNA
Experimental Results
Datasets
To test the empirical performance of MXSCARNA, we used three datasets for the benchmark multiple alignments: an original multiple alignment dataset, the BRAlibaseII multiple alignment dataset, and Kiryu et al.’s multiple alignment dataset. Our original dataset comprised 1669 multiple alignments of 5 sequences, the secondary structures of which have been published, obtained from the Rfam 7.0 database. There are 27 families of RNA sequences in the dataset and the sequence identities varied from 35% to 100%. Sequences that included bases other than A, C, G, and U were removed because some of the alignment programs were unable to align them. The BRAlibaseII benchmark dataset included 481 multiple alignments of 5 sequences. The sequences of each multiple alignment were extracted from tRNA, Intron gpII, 5S rRNA, and U5 families in the Rfam 5.0 database and the signal recognition particle RNA family (SRP) in the SRPDB database. Because the dataset did not include consensus secondary structure annotations to the alignments, we used the secondary structure annotations recovered by Kiryu et al. Kiryu et al.’s multiple alignment benchmark dataset was generated from selected seed alignments in the Rfam 7.0 database that have published consensus structures. For each sequence family, as many as 1000 random combinations of 10 sequences were generated. The alignments whose mean pairwise sequence identity exceeded 95% and whose gap characters accounted for more than 30% of the total number of characters aligned were removed. As such, this dataset consisted of 85 multiple alignments of 10 sequences, generated from 17 sequence families, with five alignments for each. The dataset was reasonably divergent, and its mean length varied from 54 to 291 bases, and mean pairwise sequence identities varied from 40% to 94%.
Evaluation measures
The qualities of the alignments were evaluated by the Sum-of-Pairs Score (SPS) for the accuracy of the alignments and by the Matthews Correlation Coefficient (MCC) for the accuracy of the secondary structure predictions. The SPS and MCC of the alignment to be evaluated (named as a test alignment) for the reference alignment were defined as follows. The SPS was defined as the proportion of correctly aligned nucleotide pairs:
where I is the number of columns in the test alignment, J is the number of columns in the reference alignment, on column i in the test alignment SP^t_i is the total number of ”correct” nucleotide pairs which also appear in the reference alignment, on column j in the reference alignment SP^r_j is the
total number of nucleotide pairs. The MCC was defined as
where TP indicates the number of correctly predicted base pairs, TN the number of base pairs that were correctly predicted as unpaired, FP the number of incorrectly predicted base pairs, and FN the number of true base pairs that were not predicted. The term ξ accounts for predicted base pairs that were not present in the reference structure but were compatible with it. Compatible base pairs are not true positives but have to be neither inconsistent (one or both nucleotides being a part of a different base pair in the reference structure) nor pseudo-knotted with respect to the reference structure. In order to calculate MCC for each test alignment, the reference alignment and the ”correct” consensus secondary structure are taken from the database. In order to compare the accuracies of the alignments in terms of the implicitly predicted common secondary structures, the common secondary structures for each test alignment by the alignment programs were predicted by the Pfold program.
Comparison of accuracies with those of other aligners
To compare the accuracies of the alignment methods we used a Linux machine with an AMD Opteron processor (2 GHz and 4 GB RAM). We compared the performance of MXSCARNA with that of Murlet, ProbCons, MAFFT, ClustalW, StrAl, MARNA, RNASampler, RNAlara, FoldalignM, Locarna, PMmulti, and Stemloc on the three datasets described earlier. Whereas ProbCons, MAFFT, and ClustalW align RNA sequences on the basis of sequence similarities only, StrAl, MARNA, RNASampler, RNAlara, FoldalignM, Locarna, PMmulti, Stemloc, and Murlet weigh both sequence similarities and secondary structures.
The results for every datasets are shown in the following. Because MARNA, Locarna, FoldalignM, PMmulti, and Stemloc impose high time and memory demands, those programs were executed only on families of which the average sequence lengths were less than or equal to 100 bases. The SPS of MXSCARNA was comparable to those of Murlet and ProbCons, which currently are the best performing aligners. In addition, the MCC of MXSCARNA was one of the highest among aligners. In particular, the MCC of MXSCARNA is similar to that of Stemloc, which aligns only short sequences that have simple secondary structures.
Execution time
Comparisons of alignment tools in regard to execution time for nucleotide sequences of various lengths are presented in the follwing fitures.
Randomly generated sequences were allocated into groups of the same lengths and were used for alignment. Stemloc aligned sequences of not more than 100 bases; FoldalignM and Locarna were faster than Stemloc and aligned sequences of 500 bases or less. Because the lengths of the sequences were the same in each alignment task, the banded Dynamic Programming (DP) technique of these methods was effective. Although the Murlet program returned results for sequences as long as 4000 bases in the best case, it was much slower than MXSCARNA. MXSCARNA required only 17 seconds to align 5 sequences of 500 bases and returns alignments for sequences as long as 5000 bases, though the accuracies for sequences longer than 500 bases have not yet been evaluated. Similar comparisons for various numbers of the sequences are presented in the middle figure. The execution time of MXSCARNA is acceptable even for 50 sequences.