PSTAG

Experimental Results

To confirm our method, we performed some experiments using a certain RNA family in the database. We first randomly chose an RNA sequence annotated with a known pseudoknot structure and parsed it into a skeletal tree, then aligned all the other ‘unfolded’ RNA sequences in the family into the selected ‘folded’ skeletal tree without using annotations of them. We evaluated the results of our experiments by specificity and sensitivity, that is, the rate of correctly predicted base pairs by the method to all predicted base pairs, and the rate of correctly predicted base pairs to all of the trusted base pairs in the database, respectively. Further, in order to remove the dependency of the prediction results on the selected folded RNA sequence, we performed cross-validation and calculated the average for all cases.

The datasets used in our experiments were taken from RNA families database ‘Rfam’ at Sanger Institute (Griffiths-Jones et al., 2003) and a collection of RNA pseudoknots ‘PseudoBase’ at Leiden University (van Batenburg et al.). RNA sequences in Rfam are aligned and annotated with secondary structures by using the covariance model (CM) method (Eddy and Durbin, 1994). Among 176 RNA families in Rfam (version 5.0), 7 RNA families have pseudoknot annotations which are unreliable becauseCMis based on profile SCFGs for modeling RNAsequences which cannot deal with pseudoknots. On the other hand, the annotations of pseudoknot RNA structures in PseudoBase are biologically reliable.

First, we evaluated the accuracy of predicting base pairs by the PSTAG algorithm for three RNA families, Corona_pk3, HDV_ribozyme and Tombus_3_IV, which have pseudoknot annotations in Rfam. Corona_pk3 and HDV_ribozyme constitute simple pseudoknot structures which can be analyzed by an SLTAG, whereas Tombus_3_IV has one branching secondary structure involving a pseudoknot which requires an ESL-TAG. The results in Table 2 show that PSTAG can predict accurate structural alignments for all three RNA families.

Table 2:
The result of predicting base pairs including pseudoknots by PSTAG for three RNA families in Rfam.
len. of seqs. # of seqs. specificity (%) sensitivity (%) time (sec) memory (MB)
Corona_pk3 62.9 14 95.5 ± 5.0 94.6 ± 5.0 25.8 ± 1.2 (4.33 ± 0.17) * 10^2
HDV_ribozyme 89.1 15 95.6 ± 5.1 94.1 ± 5.6 177 ± 10 (2.23 ± 0.12) * 10^3
Tombus_3_IV 91.2 18 97.4 ± 6.0 97.4 ± 6.0 214 ± 17 ( 2.70 ± 0.10) * 10^3

Each value in columns of specificity, sensitivity, time and memory represents average and standard deviation of them with respect to the number of sequences. CPU time and memory usage are on a machine with Intel Pentium4 2.80 GHz processor and 4 GB RAM.

Second, we compared the prediction accuracy of the PSTAG algorithm with that of PHMMTS and of the standard alignment software ‘Clustal-W’ (Thompson et al., 1994) by an RNA of HDV_ribozyme in PseudoBase with reliable annotations about pseudoknot structures. In this experiment, PHMMTSignores annotations of some stacked base pairs with crossing dependency due to the lack of generative power for pseudoknots. Similarly, Clustal-W ignores any structural annotations due to lack of generative power for secondary structures. Each row of Table 3 shows the accuracy of predicting base pairs by PSTAG, PHMMTS and Clustal-W, respectively. Figure 8 shows the detailed comparison among the three methods, in which correctly predicted structures by each method are indicated by the mark ‘ ’. Obviously, PSTAG succeeded in predicting both ‘( )’ and ‘[ ]’ base pairs, PHMMTS can predict only “( )” base pairs and Clustal-W can predict a few structural annotations. These results indicate that more grammatically powerful the method used, the more accurate the predictions obtained. However, a more grammatically powerful method would consume lager CPU time and memory space generally.

Table 3:
The accuracy of predicting base pairs for HDV_ribozyme (PKB76) in PseudoBase by PSTAG, PHMMTS, and Clustal-W.
specificity (%) sensitivity (%)
PSTAG 88.9 96.0
PHMMTS 46.4 52.0
Clustal-W 25.9 28.0

Figire 8: The detailed comparison for HDV_ribozyme (PKB76) in PseudoBase by secondary structure prediction among three methods: PSTAG, PHMMTS and Clustal-W. Correctly predicted structures by each method are indicated with the mark ‘ ’.

pstag-fig8

In the third experiment, we structurally re-aligned all the RNA sequences of the HDV_ribozyme family in Rfam, which are unreliable regarding pseudoknots, into reliable pseudoknot structures of HDV_ribozyme in PseudoBase by PSTAG. As a result, PSTAG significantly improved about 25% base pairs in Rfam for HDV_ribozyme, which are undesirable in comparison to PseudoBase. For example, there are some significant differences between the annotation in Rfam and prediction by PSTAG as shown in Figure 9, where some undesirable base pairs, indicated with the mark ‘^’, are annotated in Rfam. Therefore, PSTAGcan predict more stable secondary structures on these undesirable base pairs than the annotations in Rfam. In addition, the predictions by PSTAG have some suggestion of a new structure, which constitutes an additional internal loop in the 3-end ofHDV_ribozyme, as indicated with the mark ‘ ’ in Figure 9 and also indicated with the arrow ‘’ in Figure 10.

Figure 9: PSTAG improved about 25% base pairs in the annotations of HDV_ribozyme (RF00094 for the above example) in Rfam by using the annotation of HDV_ribozyme (PKB76) in PseudoBase. Undesirable base pairs annotated in Rfam are indicated with the mark ‘^’, whereas an additional internal loop suggested by PSTAG is indicated with the mark ‘ ’.

pstag-fig9

Figure 10: A new structure suggested of HDV_ribozyme (RF00094) by PSTAG. An additional internal loop is indicated with an arrow.

pstag-fig10

There does not exist any other structural alignment approach to align and predict pseudoknot RNA structures. In non-comparative approaches, there are several theoretical or heuristic works to predict pseudoknot RNA structures for a single RNAsequence by maximizing stacking base pairs or free energy minimizations (Abrahams et al., 1990; Cary and Stormo, 1995; Gultyaev et al., 1995; van Batenburg et al., 1995; Rivas and Eddy, 1999; Lyngsø and Pedersen, 2000; Ieong et al., 2003; Ruan et al., 2004). Ruan et al. (2004) recently proposed a simple but effective heuristic method, called iterated loop matching (ILM), for predicting pseudoknot structures, and showed high performance results compared with other existing methods. Although their approach is completely different from ours, we compared PSTAG with ILM to confirm the effectiveness of our approach. Table 4 shows comparisons between ILM and PSTAG in predicting pseudoknot structures for HDV_ribozyme and a ‘tobamovirus’ TMV. In this experiment, PSTAG aligned unfolded RNA sequences of HDV_ribozyme in Rfam into a folded RNA sequence of HDV_ribozyme with structural annotations in PseudoBase, and similarly, aligned unfolded sequences of TMV in Rfam into a folded RNA sequence of ‘sunn-hemp mosaic virus’ CcTMV whose structure has been determined by van Belkum et al. (1985). Note that sequence homology between the sequences of HDV_ribozyme in Rfam and the selected sequence of HDV_ribozyme in PseudoBase is 65.1% on average and that between TMV in Rfam and CcTMV is only 26.0%. This result exhibits comparable performances with ILM for prediction accuracy of pseudoknot structures, and further suggests that structural alignment by PSTAG does not require so much sequence homology between an unfolded sequence and a folded sequence.

Table 4:
Comparisons of prediction accuracies between PSTAG and ILM.
HDV_ribozyme TMV
(%) specificity sensitivity specificity sensitivity
PSTAG 88.9 96.0 92.0 92.0
ILM 100.0 82.4 80.0 80.0