Rfold

Experimental Results

Here, we summarize the performance of Rfold presented in the original paper. For more details, see the original reference.

Dataset

We extracted 151 alignments of structural RNA families from the seed alignments in the Rfam7.0 database. All these alignments had annotated secondary structures that had been published in the literature. We then selected a single representative sequence from each family that had the maximal number of canonical base pairs. From these sequences, we created four types of dataset (Datasets1–4).

Dataset1 comprises these 151 RNA sequences.
Dataset2 is created from Dataset1 by appending random sequences of length e = 100, 300, 500, and 1000 to both the ends of each sequence.
Dataset3 contains a single sequence of length 172k bases, which is obtained by concatenating the sequences of Dataset1 and the random sequences of length 1000 alternately.
Dataset4 comprises 10 random sequences of length 10k bases, and it is used as the control set to estimate the false positive rate of the structure predictions.

The random sequences of these datasets were generated by concatenating the 151 RNA sequences and shuffling the nucleotides of the sequence, while conserving the dinucleotide frequency.

Accuracy Measures

To estimate the accuracy of the base pairing probabilities (BPPs) p(i, j) and the structure predictions, we draw receiver operator characteristic (ROC) curves that represent the balance between the sensitivity to the true base pairs and the rate of false positives in the non-structured sequences. In the case of BPP comparison, the sensitivity is defined by the fraction of the true base pairs that have a BPP larger than the given threshold value p0. The false positive rate is defined by the frequency of the base pairs (i, j) with p(i,j)> p0 in the non-structured sequence divided by the length of the sequence. We draw the ROC curve by examining several values of p0. In the case of structure prediction, the sensitivity is defined by the fraction of base pairs that are correctly predicted by the programs. We define the false positive rate as the fraction of the inner regions (i.e. the segments enclosed by any base pair) in the non-structured sequence. This definition penalizes long inner regions that contain only a small number of predicted base pairs. Furthermore, only the base pairs that satisfy the maximal span constraint are counted as true base pairs in order to remove the effect of trivial loss of sensitivity to the distant base pairs |i-j|>W.

Comparison of the quality of the computed local base pairing probabilities

rfold1

This figure shows the ROC curves of computed base pairing probabilities. False positive rates are calculated for Dataset4 (random sequences), and the sensitivities are calculated for Dataset1 (151 Rfam seed sequences) (circles) and Dataset2 with e=1000 (i.e. random sequences of length 1000 are added to both the ends of each Rfam sequence) (triangles). The open and filled symbols represent the ROC curves of Rfold and RNAplfold, respectively. In both the figures, the maximal span W is set to be 800.

Comparison of the accuracy of local structure predictions

rfold2

Comparison of the accuracies of the predicted structures. We used Dataset4 for the computation of the false positive rate and Dataset3 for the sensitivity.

We examined three values of W — 50 (circle), 100 (triangle), and 200 (square).
Since RNALfold(denoted by “Lfold” in the figure) has no parameter that strikes the balance between the sensitivity and the false positive rate, only one point is plotted for each values of maximal span W.

Comparison of Running Time

BPP Computation Rfold 15min
RNAplfold 12min
Structure Prediction Rfold 22min
RNALfold 30sec

Comparison of running times of Rfold, RNAplfold, and RNALfold. We used Dataset3, which consists of a sequence with a length of 172k bases. The maximal span W is set to be 100.