Question

getting a combination of all predicted genes

0

Entering edit mode

5.9 years ago

bitpir ▴ 240

Hello, I was wondering if anyone knows the best way to doing this. I have results from 2 ORF predictors, Mga and Genemark. I now want to combine all the predicted ORFs together. I would like to keep all overlapping ORFs and also the genes that are slightly longer (e.g. +10bp). The MGA result is as follow:

NC_003552.1 mga cds 1   1500    133.183 -   0   gene_1; 11  
NC_003552.1 mga cds 1501    2000    114.856 +   0   gene_2; 11  
NC_003552.1 mga cds 1750    2000    81.7025 -   0   gene_3; 11

The Genemark result is this:

NC_003552.1 gms cds 1   1503    133.183 -   0   gene_1; 11
NC_003552.1 gms cds 1501    1780    114.856 +   0   gene_2; 11

My desired result is this:

NC_003552.1 gms cds 1   1503    133.183 -   0   gene_1; 11
NC_003552.1 mga cds 1501    2000    114.856 +   0   gene_2; 11  
NC_003552.1 mga cds 1750    2000    81.7025 -   0   gene_3; 11

I have tried to use bedtools intersect but it didn't get me the right answer. Other better tools to doing this will be welcomed too! Thanks!

bedtools gff • 1.3k views

ADD COMMENT • link updated 5.9 years ago by Alex Reynolds 35k • written 5.9 years ago by bitpir ▴ 240

0

Entering edit mode

you bluntly want to merge the 2 predictions together? I'm not convinced this is the way forward as this does not guarantee the 'merged' models will make any biological sense.

But I'm guessing you might know that already. So what is the goal of all this?

ADD REPLY • link 5.9 years ago by lieven.sterck 15k

0

Entering edit mode

@Alex may provide a specific answer for this but you can take a look at his older answer here: How To Get Annotation For Bed File From Another Bed File

ADD REPLY • link 5.9 years ago by GenoMax 141k

0

Entering edit mode

I do not know anything about these predictors and their output format. But I can certainly say BEDtools will not work on them (most of the BEDtools functions require BED regions to work).

From what I understand from your example, you just need union of all results?

cat MGA_results.txt GENEMARK_result.txt | sort -u > all_results.txt

Is this something you are looking for?

ADD REPLY • link 5.9 years ago by venu 7.1k

0

Entering edit mode

Thanks for the response! I actually made a mistake in the desired result, it's corrected now. I aim to get all of the orfs predicted by mga and genemark. However, when the same orf is predicted in the same region, I'll give preference to the longer one:

NC_003552.1 mga cds 1   1500    133.183 -   0   gene_1; 11  
NC_003552.1 gms cds 1   1503    133.183 -   0   gene_1; 11
NC_003552.1 mga cds 1501    2000    114.856 +   0   gene_2; 11  
NC_003552.1 gms cds 1501    1780    114.856 +   0   gene_2; 11

In this case, I'll take the second and the third ones.

If there are overlapping ORFs or orfs in the other strand (-), then I'll keep both:

NC_003552.1 mga cds 1501    2000    114.856 +   0   gene_2; 11  
NC_003552.1 mga cds 1750    2000    81.7025 -   0   gene_3; 11

This will result in the final result:

NC_003552.1 gms cds 1   1503    133.183 -   0   gene_1; 11
NC_003552.1 mga cds 1501    2000    114.856 +   0   gene_2; 11  
NC_003552.1 mga cds 1750    2000    81.7025 -   0   gene_3; 11

ADD REPLY • link 5.9 years ago by bitpir ▴ 240

0

Entering edit mode

Try this and let me know if it scales up (remove last two columns):

$ cat gms.txt mga.txt | datamash  -s -g 1,3,4,7  --full min 4 max 5
    NC_003552.1 gms cds 1   1503    133.183 -   0   gene_1; 11  1   1503
    NC_003552.1 mga cds 1501    2000    114.856 +   0   gene_2; 11  1501    2000
    NC_003552.1 mga cds 1750    2000    81.7025 -   0   gene_3; 11  1750    2000

ADD REPLY • link 5.9 years ago by cpad0112 21k

score 0 · Answer 1 · 2018-05-22

0

Entering edit mode

5.9 years ago

Alex Reynolds 35k

Union with bedops -u and perhaps filter the unioned set by overlap with bedmap --range or bedmap --echo-overlap-size operations, piping to awk to print qualifying records. I have to admit that the logic is a bit unclear to me but that's perhaps where I might start.

ADD COMMENT • link 5.9 years ago by Alex Reynolds 35k