Standalone Pipeline system of CentroidHomfold-LAST
A standalone pipeline system of CentroidHomfold-LAST
This pipeline predicts a secondary structure of a target RNA sequence by using automatically collected homologous sequences. The pipeline is used in the CentroidHomfold-LAST Web application (http://www.ncrna.org/centroidhomfold/). By using this pipeline, users can utilize a database derived from user-specified RNA sequences for the candidates of homologous sequences, while the CentroidHomfold-LAST Web application can not (because the Web application employs only prepared databases).
The method used in this pipeline is based on a combination of two previously published software by our group: CentroidHomfold (Hamada et al., Bioinformatics 25(12): i330-i338, 2009) and LAST (Frith et al., BMC Bioinformatics 11:80, 2010) (Kielbasa et al., Genome Res. 21: 487-493, 2011).
This pipeline works on a standard Linux OS machine. We tested the pipeline on a Linux machine with SuSE OS.
Installation & test
First of all, please download the pipeline package from here (The size is about 16M byte because it contains a database of RNA sequences).
% tar zxvf centroidhomfold-last-v1.0-pipeline.tar.gz
% cd centroidhomfold-last-v1.0-pipeline
centroid_homfold.pl db external manual.html param scripts test test_pipeline.pl
A Perl script "centroid_homfold.pl" in the top directory is the executable file of the pipeline. The "manual.html" is the same as this page.
The directories "external" and "scripts" contain several binaries and scripts used in this pipeline. The "db" directory includes a pre-compiled database. The directory "param" contains the parameter file of the BL model. The "test" directory contains a few test data used in "test_pipeline.pl".
Set the enviromental variable "CENTROID_HOM_LAST" to the top directory of the pipeline.
% export CENTROID_HOM_LAST=/Top/Dir/To/Pipeline
If the binaries in the external directory do not work, you must install the following software in the "external" directory.
- "centroid_fold" and "centroid_homfold" binaries: Please download the centroid_fold package from http://www.ncrna.org/software/centroidfold/download/ and compile it. Then copy (or make symbolic-link of) "centroid_fold" and "centroid_homfold" to the external directory.
- "lastdb", "lastex" and "lastal" binaries: Please download the last package from http://last.cbrc.jp/ and compile it. Then copy (or make symbolic-link of) "lastal", "lastdb" and "lastex" to the external directory.
- "RNAplot" binary: Please download Vienna RNA package from http://www.tbi.univie.ac.at/~ivo/RNA/ and compile it. Then copy (or make symbolic-link of) "RNAplot" to the external directory.
Run the following script to test the pipeline.
Please see the log of this pipeline. If the pipeline works correctly, the same log should be obtained.
Steps in pipeline
- Search homologous sequences from a pre-compiled database using LAST: the score threshold is automatically determined by lastex using E-value (-e). An optimal score is computed from the E-value, database and input RNA sequence.
- Reduce the number of homologous sequences when it is too many: select from top scores (-m 0) or randomly (-m 1)
- Execute CentroidHomfold. The gamma parameter (that adjusts sensitivity and PPV of base-pairs in a predicted secondary structure) is specified by using "-g" option. Larger gamma produces more base-pairs in a predicted secondary structure.
- Remove non-canonical base-pairs in predicted secondary structures. If you would like to include non-canonical base-pairs in the prediction, please use the "--nc" option.
The usage of the pipeline is basically as follows:
% centroid_homfold.pl [options] <input.fna>
The "input.fna" should be FASTA format. In the current pipeline, "input.fna" must NOT contain more than 1 sequence.
We explain several examples of how to use the pipeline.
(1) Secondary structure prediction by using pre-compiled database
% centroid_homfold.pl -D db/Rfam.seed.99.db -S db/Rfam.seed.99.fasta -g 8 -e 0.01 -n 30 input.fna
This is a standard usage of the pipeline. See the section "Generate pre-compiled database by yourself" for how to generate a pre-compiled database for your specified RNA sequences.
(2) Secondary structure prediction by using user-specified RNA sequence (NOT using pre-compiled database)
% centroid_homfold.pl -S db/Rfam.seed.99.fasta -g 8 -e 0.01 -n 30 input.fna
If you specify "-S" option, the database is compiled in the pipeline. Therefore the speed is slower than 1) in which a pre-compiled database is utilized. If you predict secondary structures of several RNA sequences by using a large database of RNA sequences, you should employ a pre-compiled database.
(3) Secondary structure prediction by using user-specified homologous sequence (like CentroidHomfold does)
% centroid_homfold.pl -H db/Rfam.seed.99.fasta -g 8 -e 0.01 -n 30 input.fna
The pipeline predicts the secondary structure of input.fna by using homol.fna as homologous sequences. In this mode, LAST is not utilized in the pipeline.
OptionsThe main options in the pipeline are as follows:
-H <string> : Homologous sequences [null]. If you specify this option, CentroidHomfold
is directly called.
-D <string> : Prefix of LAST database (also specify -S option) [db/Rfam.seed.99]
-S <string> : Sequence file for DB [db/Rfam.seed.99.fasta]
-g <float> : Gamma parameter in CentroidHomfold; Larger gamma predict more base-pairs
in a predicted secondary structure 
-o <string> : Output file [stdout]
Users can specify the probabilistic models for secondary structures and pairwise alignments by using the following options.
--engine_a <string> : Model for alignments; ProbCons (Do et al., Genome Res. 2004),
CONTRAlign (Do et al., Genome Res. ) [CONTRAlign]
--engine_s <string> : Model for secondary structures; CONTRAfold
(Do et al. Bioinformatics, 2006), BL (Andronescu et al., RNA, 2010),
McCaskill (McCaskill, Biopolymer, 1991) [BL]
Also, users can utilize the following additional options.
-e <int> : E-value threshold for local alignment. The optimal score of alignment is
computed by using the value, the input sequence and the database [0.01]
-n <int> : Maximum number of homologous sequences 
-m <int> : Methods for selecting homologous; 0=>random, 1=>score 
--last_dir <string> : Path to bin directory for LAST (location of lastal and lastex)
--cent_dir <string> : Path to bin directory for CentroidHomfold [./external]
--param <string> : Parameter file for BL model [./param/parameters_BLstar_Vienna.txt]
--nc : Allow non-canonical base-pairs
--seed <int> : Seed for rand() when -m 1 is specified 
--tmpdir <string> : Temporary directory
--log : Show log messages
--ps : Generate a postscript (PS) file of a predicted secondary structure
Generate pre-compiled database by yourself
Due to the limitation of the size of the pipeline, only a small database (Rfam.seed.99) is contained in this package. This database is generated from the seed alignments in the Rfam database (Gardner et al., 2008).
If you would like to employ a database pre-compiled by yourself, follow these instructions to generate a pre-compiled database:
1) Prepare a sequence file in FASTA format: seq.fna
The header of each sequence should have the following format
where n should be integer starting 1. Every nucleotide sequence must be one-line.
Example: (See also "db/Rfam.seed.99.fasta")
2) Generate LAST database
% external/lastdb DB_NAME seq.fna
where DB_NAME is an arbitrary name of your database. This name must be specified in "-D" option when executing the pipeline "centroid_homfold.pl" i.e. -D DB_NAME
3) Generate an index of RNA sequences
% scripts/mkidx.pl seq.fna > seq.fna.ary
The name of the index file should be SEQ_FILE_NAME.ary (See the above example). The index is used when the actual sequence is retrieved. The sequence file must be specified in "-S" option i.e. -S seq.fna
4) Execute the CentroidHomfold-LAST pipeline as follows:
% centroid_homfold.pl -S SEQ_FILE_NAME -D DB_NAME [other options] input.fa
Performance of pipeline
In general, CentroidHomfold-LAST achieves better performance than conventional secondary structure predictions such as CentroidFold and RNAfold when several homologous sequences are obtained. Please see the paper (in the reference section) or this page for the detailed results of computational experiments.
Credits & Citation
If you use this pipeline in your research, please cite the following publication:
| CentroidHomfold-LAST: Accurate prediction of RNA secondary structure
| using automatically collected homologous sequences
| Michiaki Hamada, Koichiro Yamada, Kengo Sato, Martin C. Frith and Kiyoshi Asai
| submitted (2011)