To confirm our method, we have done some experiments. A data set used for training and test in our experiments is extracted from “Rfam” RNA family database (Griffiths-Jones03), which provides two alignment sets; a “SEED” set of high quality alignments which are obtained manually, and a “FULL” set of alignments generated by the covariance model (CM) method (Eddy94). We randomly extracted structurally aligned pairs of RNA sequences as the data set from 4 families in Rfam SEED because these families have enough number of alignments which are biologically plausible. The summary of the data set is shown in Table 2.
|RF00001||116.8||70.1||5S ribosomal RNA|
|RF00002||152.2||74.8||5.8S ribosomal RNA|
|RF00008||55.2||75.8||Hammerhead ribozyme (type III)|
The column of Len. means the average length of sequences, and the column of Homo. means the average homology of each pair of sequences calculated by the Smith-Waterman algorithm.
We have trained the parameters of the weight vector corresponding to the emission features and the state transition features for structural alignment according to CRF’s training method, in which 400 pairs in the data set are used as the training data. Then, we have aligned 100 pairs of RNA sequences in the rest of the data set with trained parameters; one of each pair is used as an unfolded RNA sequence, and aligned into the other using its annotations of secondary structures.
The result of them has been evaluated by specificity and sensitivity of predicted base pairs, the rate of correctly predicted base pairs 20 to all of predicted base pairs, and the rate of correctly predicted base pairs to all of the annotated base pairs in the data set, respectively.
Table 3 shows the accuracy of predicting base pairs by structural alignment for each family. Here, CRF is compared to the other scoring methods: EM by the expectation maximization (EM) algorithm, RIBOSUM by (Klein03), and FOLDALIGN by (Gorodkin97). The result clearly shows that CRF can outperform all the other existing methods for structural alignment of RNA sequences.
|SP (%)||SN (%)||SP (%)||SN (5)||SP (%)||SN (%)||SP (%)||SN (%)|
The columns of “SP” and “SN” indicate specificity and sensitivity of predicting base pairs, the rate of correctly predicted base pairs to all of predicted base pairs, and the rate of correctly predicted base pairs to all of the annotated base pairs in the data set, respectively.
CRF and the EM algorithm are similar to each other in the point that both methods compute the expectation of the parameters with dynamic programming such as the inside-outside algorithm and optimize them by maximizing the likelihood of training data. However, the EM algorithm is for generative models, that is, optimizes their parameters such that the model can generate the training data most likely, but has no guarantee to find the optimal alignments. On the other hand, CRF is for discriminative models, that is, optimizes their parameters such that the model can discriminate between correct alignments given by the training data and incorrect alignments most likely. Therefore, for the task of finding the optimal alignment of given sequences, CRF is more suitable than the EM algorithm.
RIBOSUM is the matrix of log-odds scores of the probability distribution on RNA mutations obtained by the maximum likelihood (ML) estimation against the background distribution on the RNA sequences, and has two meta-parameters: the first is the same as the BLOSUM matrices; the percentage of identical for grouping sequences, and the second is the lower bound of the homology between sequences which are used for calculating the matrix. Although these preprocessing of the data set could help for avoiding overfitting, RIBOSUM still requires a large number of high quality structure-annotated alignments. In addition, there are no efficient method to determine the optimal meta-parameters because of a great variety of candidates for the two RIBOSUM meta-parameters along with some alignment parameters such as gap penalties. On the other hand, CRF provides the efficient training algorithm that can determine the optimal parameter set, and has the generalization ability so that a satisfiable score matrix can be obtained even with a small number of sample data without overfitting. In our experiments, CRF only uses about 100 structural alignments of RNA sequences for each RNA family and succeeds to obtain better score matrix than the others.
To investigate the affect of training data on the performance of structural alignment, we have compared three types of training data for each family: all the families, only the same family as the test data, i.e., the test data is a “seen” family for the training data, and the other families excluding the test family, i.e., the test data is an “unseen” family for the training data. Every training data contains 400 pairs of sequences. The result in Table 4 indicates that if the test data is a “seen” family for the training data, there are no significant differences in the accuracy of predicting base pairs in comparison to using all the families, and even if the test data is an “unseen” family for the training data, our method yet retains high accuracy. This result suggests that our method could align RNA sequences of unknown families with high accuracy.
|all families||only the same family||the other families|
|SP (%)||SN (%)||SP (%)||SN (%)||SP (%)||SN (%)|
|RF00001||89.7||89.5||90.9 (+1.2)||91.4 (+1.9)||88.2 (-1.5)||87.9 (-1.6)|
|RF00002||84.7||82.5||87.3 (+2.6)||86.4 (+3.9)||78.7 (-6.0)||76.3 (-6.2)|
|RF00005||97.3||97.0||96.8 (-0.5)||96.8 (-0.2)||95.6 (-1.7)||95.7 (-1.3)|
|RF00008||91.5||92.2||90.1 (-1.4)||91.1 (-1.1)||86.6 (-4.9)||90.2 (-2.0)|
|total||90.4||89.7||91.3 (+0.9)||91.4 (+1.7)||87.3 (-3.1)||87.5 (-2.2)|
The columns of “all families”, “only the same family” and “the other families” indicate the family of the training data used for building the model. Each value in parentheses is difference from the column of “all families” on the same line.
We have done more practical experiments for predicting non-coding RNA regions by local alignment searches with a folded sequence of the target RNA on genome sequences. The regions of DNA subsequences significantly aligned into the folded RNA structure are predicted as possible candidates of RNA coding regions of the same RNA family. For this experiment, we first prepare some longer DNA sequences including the true RNA coding sequences by appending dozens of nucleotides randomly with the same composition to the upstream and downstream of the RNA sequences. Second, we predict RNA regions on the prepared DNA sequences by aligning into the same family of a folded RNA structure. Then, we evaluate the correctness of the predicted RNA regions by counting errors which are the number of nucleotides not included in the region.
Table 5 shows the average of prediction errors for each family. The column of PHMM indicates the prediction errors by the usual (unstructured) pair hidden Markov models. The result indicates that PHMM completely fails to predict RNA regions because of lack of ability to deal with any secondary structures, and structural alignment based on CRFs is more accurate for predicting non-coding RNA regions than the other scoring methods such as EM, RIBOSUM and FOLDALIGN. These experimental results strongly support our discriminative method employing CRFs to estimate the score matrix parameters.