RNAmine

RNAmine is a software tool to extract the structural motifs from a set of RNA sequences using a graph mining technique. Developed by Mizuho Information & Research Institute.

Results

For benchmarking, we used the Rfam database (version 7.0, March 2005) containing 503 RNA families whose common secondary structures are available for the seed sequences. We selected eight families whose number of hairpins is more than three. All the families except tRNA are used by CMfinder as well. For the families that contain more than 50 seed sequences, we randomly selected 50 sequences. We did not include the sequences with a nucleotide character other than A, C, U, G and T. The dataset is summarized in Table 1. All experiments are performed using a machine with a 2.4GHz AMD Opteron (TM) processor and 20GB memory.

Secondary Structure Prediction

Secondary structure prediction of an individual sequence can be done by free energy minimization using, e.g., RNAfold. However, when the sequences share a common secondary structure, it is often better to find the common structure and parse each sequence using the common structure. This process can be implemented using the EM algorithm over the covariance model, CMfinder and a graph-theoretical method, comRNA. When a multiple alignment of the sequences is given a priori, the secondary structure can be predicted by, e.g., RNAalifold Another approach to predict the secondary structure is to derive a number of suboptimal secondary structures for a sequence.

In this experiment, RNAmine is compared with CMfinder RNAfold RNAsubopt RNAalifold and comRNA CMfinder and comRNA exploit the common secondary structure, while RNAfold and RNAsubopt predicts the structure individually. Like RNAmine, RNAsubopt derives a number of multiple possible structures for a sequence. RNAalifold assumes the multiple alignment, which is made here by clustalW Those tools are used mostly with the default parameters (See the supplementary paper for details). For RNAmine we set minsup and maxcost to be 0.7 and 0.6, and w_1, w_2, w_3 and w_4 to be 0.6, 0.15, 0.10, 0.15, respectively.

We used the MCC (Mathews Correlation Coefficient), defined in section 3 in the supplementary paper, as the performance measure. The average MCCs are summarized in Table 1 and the running times are shown in Table 2. See sensitivity and PPV for each family in the supplementary paper (Table S1 and S2). The number of predicted structures per sequence is also shown for RNAsubopt. RNAmine performed better than RNAfold, RNAsubopt and comRNA in most cases. The accuracies of RNAfold are not better than those of the other methods, showing the difficulty of predictions from individual sequences. The results of RNAalifold in Table 1 have limited accuracies because the alignments of clustalW were used. When the reference alignments that had been annotated in Rfam database were used, the results were much better (See Table S3 in supplementary paper). RNAsubopt performed relatively well, but it is due to the large number of predictions (about 120 in maximum and 44 in average). In comparison to CMfinder, RNAmine achieved comparative accuracy overall, and for several families such as Lysine and RFN, RNAmine performed better. The homogeneity of this dataset was relatively high, because the sequence set is derived from one family. This result shows that RNAmine can compete well with the state-of-the-art methods even in those clean datasets. In addition, RNAmine can deal with non-homogeneous data as shown in the next section.

It is remarkable that RNAmine was more than twenty times faster than CMfinder, though the worst-case time complexity of graph mining is theoretically NP-hard Graph mining is fast when the size of search tree is kept small as in our implementation.

The computational time decreases monotonically as the minimum support increases, because the search tree can be pruned earlier if the minimum support is high. It is interesting to see that the best accuracy is achieved at 0.7, which is much better than the accuracy at 1. Setting the minimum support to 1, a stem pattern matching all the sequences is obtained. However, when the minimum support is below one, multiple stem patterns are obtained, each of which matches to a subset of sequences. This result shows the structure prediction can be enhanced by exploiting hidden clusters in the family.

**Table 1:**
Summary of the test data and the results
Family	RFAM_ID	#seqs	length	%id	RNAmine			CMfinder	comRNA	RNAalifold	RNAfold	RNAsubopt
Family	RFAM_ID	#seqs	length	%id	1	5	10	MCC	MCC	MCC	MCC	MCC	str/seq
Cobalamin	RF00174	50	203.2	43	0.41	0.52	0.53	0.54	0.00	0.47	0.34	0.44	119.6
Lysine	RF00168	50	181.6	46	0.80	0.85	0.86	0.79	0.21	0.35	0.64	0.74	112.3
Purine	RF00167	37	99.6	53	0.83	0.90	0.91	0.89	0.00	0.52	0.73	0.81	8.3
RFN	RF00050	48	137.2	64	0.62	0.71	0.74	0.41	0.00	0.57	0.44	0.52	29.4
S_box	RF00162	50	110.4	61	0.77	0.82	0.84	0.78	0.29	0.48	0.64	0.76	35.1
Tymo_tRNA-like	RF00233	27	82.6	66	0.76	0.88	0.88	0.93	0.55	0.51	0.60	0.72	10.3
glmS	RF00234	14	177.6	55	0.80	0.86	0.90	0.88	0.47	0.35	0.58	0.66	30.7
tRNA	RF00005	50	73.4	40	0.75	0.84	0.84	0.78	0.00	0.37	0.60	0.73	8.5
average					0.72	0.80	0.81	0.75	0.19	0.45	0.57	0.67	44.3

RFAM_ID: ID number in Rfam database (http://www.sanger.ac.uk/Software/Rfam/).
#seq: the number of sequences in each family.
length: average length of sequence in each family.
%id: average sequence identity calculaed by alistat program.
MCC: average MCC among sequences.
Best MCC among top 1, 5 and 10 structures are shown in result of RNAmine. For comRNA and RNAsubopt, the best MCC among predicted common secondary structures is shown (if comRNA produced no motif, MCC is 0 in this table).
str/seq (for RNAsubopt): the average number of predicted suboptimal secondary structures per sequence.

The definition of MCC is found in the supplementary paper.

**Table 2:**
Running time (seconds) for each family in Table 1
Family	RNAmine	CMfinder	comRNA	RNAsubopt
Cobalamin	46.7	1684.9	1072.1	20.3
Lysine	115.7	1397.5	944.5	19.3
Purine	4.0	198.9	1216.1	0.6
RFN	35.5	630.8	648.7	2.8
S_box	8.0	352.6	627.2	2.8
Tymo_tRNA-like	1.8	106.0	816.7	0.3
glmS	26.9	355.3	732.5	1.8
tRNA	2.5	230.5	1220.4	0.6
average	30.1	619.6	909.8	6.0

The results of RNAalifold and RNAfold are omitted. The running time of both tools are within a few seconds for all the families.

Dataset with Multiple Families

In this experiment, our method is applied to the input sequences including multiple RNA families. Six datasets are generated by combining two families in Table 1 into one. We compared proposed method with only RNAfold and RNAsubopt, because the other tools assume that input sequences are related sequences or cannot handle multiple families. For RNAmine, minsup is set to be 0.3 and the other parameter settings are the same as the previous experiments in Table 1. Table 3 shows our results. In comparison to RNAfold and RNAsubopt, RNAmine has achieved better accuracies uniformly in all the datasets. Moreover, RNAmine indeed detected the two families as the separate stem patterns in most cases (see the supplementary paper).

**Table 3:**
Results for multiple family dataset
	RNAmine	RNAfold	RNAsubopt
Family	MCC	MCC	MCC	#secs/seq
Cobalamin + Lysine	0.62	0.49	0.59	116
Lysine + RFN	0.79	0.54	0.63	70.9
Purine + Tymo_tRNA-like	0.91	0.67	0.76	9.3
S_box + Purine	0.86	0.69	0.78	21.7
tRNA + S_box	0.79	0.62	0.75	21.8
tRNA + Tymo_tRNA-like	0.82	0.60	0.73	9.4
average	0.80	0.60	0.71	41.5

Each dataset is created by combining two families in Table 1 into one dataset.
MCC: average MCC.
For RNAsubopt, the best MCC among predicted suboptimal secondary structures is shown. For RNAmine, the best MCC among top 5 structures is shown.
#secs/seq (for RNAsubopt): the average number of predicted secondary structures per sequence.

mhamada
2008-03-21