SOKOS/CAN - a Sequence Oriented Kernels over SCFG with Genome Scanning Capability
A software implementation of Marginalized kernels for RNA sequence data analysis.
SOKOS/CAN (pronounced So-ko-scan) is an experimental implementation of stochastic or probabilistic context-free grammar (SCFG) for RNA sequence analysis with capability of computing the marginalizes count kernel which is a metric similarity between two RNA sequences. The similarity takes into account of potential RNA secondary structures of the RNA's. SOKOS/CAN can be used for generic RNA sequence analysis including secondary structure prediction and homology search. Current version of SOKOS/CAN has following features:
- Reading multiple RNA sequences via FASTA format file (A,C,G,T/U only please)
- Using your homemade grammar which is designed for a certain functional class of RNA
- Training SCFG with un-aligned RNA sequences via Expectation-Maximization algorithm
- Computing likelihood for each RNA sequence with given SCFG
- Predicting RNA secondary structure via CYK algorithm with given SCFG
- Homology search for a specific functional class of RNA in a genomic sequence through fixed window scanning
- Computing marginalized
N-gram (N=1,2,3,4) feature vectors for input sequences
- If you have Graphviz installed on your computer, you can create graphs of a parse tree per sequence (optional)
|
|
Figure 1.
A part of SCFG parse tree created by Graphbiz.
The parse tree shows you how a sequence is parsed via SCFG, which is
very handy when you design your own grammar. Please use '-r' option to
generate parse tree graph for Graphviz (dot).
With the '-r' option, sokos
generates a file named "$(header).dot" where $(header) is a sequence
I.D. for the sequence captured from input FASTA file. Please use dot
command to generate PostScript, PNG, Jpeg or whatever format you like
for your graph with the command looks like dot -Tps -o $(header).ps $(header).dot if you prefer PostScipt. Otherwise please consult with dot manual. |
DOWNLOAD
SOKOS/CAN is licensed under GNU Public License where you can freely use, modify and redistributed the software. No warranty and support are provided for use of this software. We are not responsible to any damage you encountered from use of this software. When you redistribute the software with your custom modification, you are required to provide the source code.
- Latest developmental
snapshot sokos-20070522.tar.gz
BUILD
"%{version}" refers to an appropriate version number.
> tar -zxf sokos-%{version}.tar.gz
> cd sokos-%{version}
> cd libsvm-2.4
> (edit Makefile if you need)
> make clean all
> cd ..
> make clean allMove or copy the executables "sokos", "kernelscan" and "sokoscan" to where you want it to reside.
USAGE
There are several executables generated through the build process:
- sokos --- SCFG training, secondary structure prediction, feature vector generation
- sokoscan --- homology search over a genomic sequence with trained SCFG, Support Vector Machine and fixed-width scanning window
- kernelscan --- a derivative of sokoscan without SVM, which emits the mean kernel value for each scanning frame
Sokos shows its usage overview when you run sokos without any argument or options.
> sokos
USAGE
sokos [OPTIONS] <sequence file> <scfg>
e.g. sokos seqs.fna test.scf
sokos [OPTIONS] <literal sequence> <scfg>
e.g. sokos ACCGGGCGUGCC test.scf
Provides fundamental services of SCFG with Marginalized Kernel computaion.
Sokos accepts a FASTA formatted file and a literal sequence string like
`ACCGGGCGUGCC' for sequence data input (training/test).
INPUT
<sequence file> a FASTA formatted sequence file (allows multiple sequences)
sokos only allows A, U/T, C and G in the sequence.
<literal sequence> a sequence string (does not allow multiple sequences)
sokos only allows A, U/T, C and G in the sequence.
<scfg> provides SCFG transition probabilities and emission probabilities.
Please refer to SCFG.txt for details.
OPTIONS
-A train SCFG with annotated sequence
-c skip SCFG learning
-l devide log likelihood with the sequence length
-n DO NOT normalize feature vectors and/or kernel matrix
-o <file> output trained SCFG parameters to <file>
-p <file> output likelihood of each sequence to <file>
-t <file> estimate and output secondary structure
with max. likelihood for each sequence to <file>
-d <float> repeat EM learning until likelihood is <float> times
greater than the previous one
'-c' cancels this option. default 1.05
-T <int> repeat EM learning <int> times at least
'-c' cancels this option. default 40
-B <int> use kernel function specified by <int>
0=linear, 1=polynomial, 2=RBF, 3=sigmoid
default 0.
-b <str> define parameters of a kernel function
example: -b a=1.0,b=0.89,...
-f <file> compute and output 1st-order feature vectors
of each sequence to <file>
-k <file> compute and output 1st-order marginalized
kernel matrix to <file>
-F <file> compute and output 2nd-order feature vectors
of each sequence to <file>
-K <file> compute and output 2nd-order marginalized
kernel matrix to <file>
-m <int> use <int> threads. will give better performance on SMP machines.
default 1.
-v prints version number
Examples
Learning
In order to train SCFG defined with the file initial.scfg with respect to the training sequence data training_seq.fna then write the trained SCFG as trained.scfg, one can run sokos as follows:In the above example, "Learn= <number>" represents the number of times the learning done. The line begin with "Learn= 0" shows the averaged value of initial likelihoods of the training sequence data with respect to the initial SCFG. "L=..." stands for a log likelihood while "P=..." is a plain likelihood. "d=..." shows how much the likelihood is improved than the previous one. Sokos stops learning when the d value drops under 1.05 by default. Please refere to -d and -T options for more details on controlling learning.
> sokos -o trained.scfg training_seq.fna initial.scfg
Learn= 0 L=-3.511356e+01, P=5.628245e-16
Learn= 1 L=-3.083211e+01, P=4.071767e-14, d=72.345239
Learn= 2 L=-2.949007e+01, P=1.558207e-13, d=3.826856
Learn= 3 L=-2.913738e+01, P=2.217169e-13, d=1.422898
Learn= 4 L=-2.901826e+01, P=2.497644e-13, d=1.126502
Learn= 5 L=-2.898123e+01, P=2.591863e-13, d=1.037723
>Testing
In order to test sequence data test_seq.fna with respect to an SCFG trained.scfg then write the results to file test_seq.lik, one can run sokos as follows:test_seq.lik contains a FASTA header of each testing sequence and its likelihod in log and non-log forms.
> sokos -c -p test_seq.lik test_seq.fna trained.scfg
> cat test_seq.lik
>TERM0620,ECOMSBBA,M77039,Escherichia coli,895..925,'rho-independent'
log_prob= -4.564901e+01 prob= 1.495851e-20
>TERM0628,ECOGNDG,M64324,Escherichia coli,1568..1599,factor independent terminator
log_prob= -4.813710e+01 prob= 1.242577e-21
>TERM0640,ECOLPLA,L27665,Escherichia coli,1099..1134,Rho-independent; putative
log_prob= -6.308309e+01 prob= 4.011992e-28
>TERM0642,ECOTEHAB,U12598,Escherichia coli,1754..1795,rho-independent trascriptional terminator
log_prob= -6.754773e+01 prob= 4.617381e-30
>Secondary Structure Prediction
In order to predict secondary structures of certain sequences seq.fna based on an SCFG trained.scfg then write them to a file seq.str, one can run sokos as follows:seq.str contains a FASTA header of each sequence, the actual sequence, the most probable secondary structure, and state labels.
> sokos -c -t seq.str seq.fna trained.scfg
> cat test_seq.str
>TERM0620,ECOMSBBA,M77039,Escherichia coli,895..925,'rho-independent'
AGCCGGUACGCAGUCAGUACCGGCUUUUUUU
(((((((((.......))))))))......)
>TERM0628,ECOGNDG,M64324,Escherichia coli,1568..1599,factor independent terminator
CAGGCCCGGAGUGCUCCUCCGGGCUUUUAAUU
...((((((((.....))))))))........
>TERM0640,ECOLPLA,L27665,Escherichia coli,1099..1134,Rho-independent; putative
GUUACCCGCCCAUGCGGGCAACUUUCUCUUCGAUUU
(((.(((((....))))).)))..............
>Kernel Computation
In order to compute marginalized kernels for certain sequence data seq.fna based on an SCFG trained.scfg, one can run sokos as follows:where seq.mk1 stores 1st order marginalized count kernels and seq.mk2 stores 2nd order marginalized count kernels. At this moment, SOKOS computes linear kernels only.
> sokos -c -k seq.mk1 -K seq.mk2 seq.fna trained.scfg
>
If you want to compute something other than linear kernel, feature vector options will be helpful.
where seq.fv1 stores 1st order feature vectors and seq.fv2 stroes 2nd order feature vectors. You can compute whatever kernels using these feature vectors.
> sokos -c -f seq.fv1 -F seq.fv2 seq.fna trained.scfg
>
REFERENCES
T. Kin, K. Tsuda and K. Asai. (2002) Marginalized Kernels for RNA Sequence Data Analysis, Proc. Genome Informatics Workshop, 2002. [PubMed]
K. Tsuda, T. Kin and K. Asai. (2002) Marginalized kernels for biological sequences. Bioinformatics 18:S268-S275. [PubMed]
Copyright Notice: All Rights Reserved By Taishin Kin 2002, CBRC/AIST

