Question

Too many sequences of EVM output

0

Entering edit mode

8.4 years ago

Ginsea Chen ▴ 130

Dear all.

I used EVM to intergate all evidences (de novo, RNA-seq and protein evidences) into a single gff3 file (evm.out.gff3), and then found that too many sequences were contained in gff3 file. So my question is how to filter these sequences and then produce available gene models like lots of genome sequencing project.

Thanks all!

genome EVM • 3.8k views

ADD COMMENT • link updated 6.2 years ago by Philipp Bayer 8.3k • written 8.4 years ago by Ginsea Chen ▴ 130

score 1 · Answer 1 · 2018-01-29

I know this is older but now I had the same problem, this is how I solved some of it.

After this command:

$EVM_HOME/EvmUtils/recombine_EVM_partial_outputs.pl --partitions partitions_list.out --output_file_name evm.out

You'll have one folder for each scaffold/pseudomolecule. In there is a file called evm.out which looks perhaps like this:

# EVM prediction: Mode:STANDARD S-ratio: 3.86 1711-2833 orient(+) score(2726.00) noncoding_equivalent(706.17) raw_noncoding(959.00) offset(252.83)
1711    1811    initial+        1       2       {Augustus_C1-g1.t1;Augustus},{GeneMark.hmm_9038_t;GeneMark.hmm}
1812    1936    INTRON                  {Augustus_C1-g1.t1;Augustus}
1937    2288    internal+       3       3       {Augustus_C1-g1.t1;Augustus},{Augustus_C1-g1.t2;Augustus}
2289    2365    INTRON                  {Augustus_C1-g1.t1;Augustus},{Augustus_C1-g1.t2;Augustus}
2366    2833    terminal+       1       3       {Augustus_C1-g1.t1;Augustus},{Augustus_C1-g1.t2;Augustus}

# EVM prediction: Mode:STANDARD S-ratio: 6.24 2875-4096 orient(+) score(5809.00) noncoding_equivalent(930.83) raw_noncoding(1741.00) offset(810.17)
2875    3002    initial+        1       2       {Augustus_C1-g2.t1;Augustus}
3003    3079    INTRON                  {Augustus_C1-g2.t1;Augustus}
3080    3267    internal+       3       1       {Augustus_C1-g2.t1;Augustus}
3268    3880    INTRON                  {Augustus_C1-g2.t1;Augustus}
3881    3920    internal+       2       2       {Augustus_C1-g2.t1;Augustus}
3921    4089    INTRON                  {Augustus_C1-g2.t1;Augustus},{ev_type:Cufflinks/ID=MSTRG.2.1/Target=cuff-MSTRG.2.1;Cufflinks},{ev_type:Cufflinks/ID=MSTRG.2.5/Target=cuff-MSTRG.2.5;Cufflinks},{ev_type:Cufflinks/ID=MSTRG.2.8/Target=cuff-MSTRG.2.8;Cufflinks}
4090    4096    terminal+       3       3       {Augustus_C1-g2.t1;Augustus}

We have two genes here, only one of them is partially supported by RNASeq. You 'could' filter the evm.out by the score, but I've noticed that genes with many exons and no RNASeq support have large scores, too. I wrote a small Python script which iterates over all evm.out files and removes predictions that have NO RNAseq in them, in my case, a line containing 'Cufflinks' because that's what I called my RNASeq evidence in the weights.txt and StringTie output gff3:

I'm running this script in the root directory of my EvidenceModeler output (the folder that has one folder for each scaffold/pseudomolecule). This makes a ton of new files ending in nonRNARemoved.out.

Then I checked whether I was happy with those files (for example, are they all empty?), then made backups of my original evm.out files:

find . -maxdepth 2 -name evm.out -exec mv {} {}.bak \;

renamed my filtered files to evm.out:

for l in $(find . -maxdepth 2 -name evm_nonRNARemoved.out); do mv $l ${l%%_nonRNARemoved.out}.out; done

and ran the gff3 conversion step from the EvidenceModeler manual:

$EVM_HOME/EvmUtils/convert_EVM_outputs_to_GFF3.pl  --partitions partitions_list.out --output evm.out  --genome genome.fasta

and concatenated the gff3 files:

find . -regex ".*evm.out.gff3" -exec cat {} \; > EVM.all_only_with_RNASeq_evidence.gff3

The output gff3 is weird and still contains the 'filtered' genes - in the second column, filtered genes have a '.', unfiltered genes have an 'EVM'.

In my case, this almost halved the gene predictions. This is kiiiiind of like filtering for AED < 1 in MAKER, but only kind of.

You could make this fancier - in my case, only one of the introns has to be supported by RNASeq, and even in that case it doesn't matter how well supported the intron is. You could be more stringent and ask to have all introns supported by RNASeq which will filter stuff even more. I leave that as an exercise to the reader :)

Other things to try:

actually filter by score. In that case, I'd start by plotting the scores in a histogram. What are noncoding_equivalent etc.? I can't find those values described anywhere. You wouldn't need to change much in the above script, don't need to check for Cufflinks but parse out the score from the comment line
filter the final gff3 by other ways, for example, get rid of genes without start or stop codons. You'd have to run gffread to get the CDS/AA I guess?

score 0 · Answer 2 · 2017-02-17

0

Entering edit mode

7.2 years ago

huxiaorongfmoet • 0

hello ,my friends,can i talk with you about the evm?

ADD COMMENT • link 7.2 years ago by huxiaorongfmoet • 0

0

Entering edit mode

这是我的邮箱632662110@qq.com

ADD REPLY • link 7.2 years ago by huxiaorongfmoet • 0