Genes flanking RIP loci
1
0
Entering edit mode
7.1 years ago
spaul8505 ▴ 20

Hi,

How is it possible to find the genes that are located within a 2 kb region flanking the RIP loci for a species. I have the RIP results from RIPCAL analysis.

My data looks like this

genes.bed

 scaffold100|size105690  3181    5110    genemark-scaffold100|size105690-processed-gene-0.34     .       +       maker   gene    
 scaffold100|size105690  5243    11348   maker-scaffold100|size105690-augustus-gene-0.109        .       +     maker   gene     
 scaffold100|size105690  11606   12232   maker-scaffold100|size105690-augustus-gene-0.110        .       +       maker   gene   
scaffold100|size105690  12428   13688   augustus_masked-scaffold100|size105690-processed-gene-0.18      .       -       maker   gene

RIP loci.bed

scaffold1000|size3289   0       3276    .       1.76688149833334        +       RIP     region  .       id="RIP_1077;RIP_length=3276;RIP_max=2.32797202797203;
scaffold1001|size3283   0       3276    .       1.91911902719   +       RIP     region  .       id="RIP_2272;RIP_length=3276;RIP_max=2.8247619047619;
scaffold1002|size3281   0       3276    .       1.87247756983344        +       RIP     region  .       id="RIP_310;RIP_length=3276;RIP_max=2.28728728728729;
scaffold1003|size5038   0       5051    .       2.01247550534218        +       RIP     region  .       id="RIP_1794;RIP_length=5051;RIP_max=3.13636363636364;

Thanks

RIPCAL mutations flanking BEDtools • 1.5k views
ADD COMMENT
1
Entering edit mode
7.1 years ago

Depending on what inputs you have:

$ gtf2bed < genes.gtf > genes.bed
$ bedmap --echo --echo-map-id-uniq --delim '\t' --range 2000 rip_loci.bed genes.bed > answer.bed

Or combine the conversion and mapping steps for bonus performance:

$ gtf2bed < genes.gtf | bedmap --echo --echo-map-id-uniq --delim '\t' --range 2000 rip_loci.bed - > answer.bed

Or in bash:

$ bedmap --echo --echo-map-id-uniq --delim '\t' --range 2000 rip_loci.bed <(gtf2bed < genes.gtf) > answer.bed
ADD COMMENT
0
Entering edit mode

Hello Alex,

I have edited my question to include a subset of the data for an example, I tried using bedmap to get the flanking regions within 2Kb of the RIP loci (see Rip loci.bed) but it did not work, I just get the entire list of RIP loci instead of the result.

ADD REPLY
0
Entering edit mode

You need to have chromosome names that match, by fixing your gene inputs:

$ awk 'BEGIN{OFS="\t";}{print "scaffoldN",$2,$3,$4,$5,$6,$7,$8;}' genes.bed | sort-bed - > genes.fixed.bed

And then again for the loci:

$ awk 'BEGIN{OFS="\t";}{print "scaffoldN",$2,$3,$4,$5,$6,$7,$8,$9,$10;}' loci.bed | sort-bed - > loci.fixed.bed

Then you can run bedmap:

$ bedmap --ec --echo --echo-map --delim '%%' --range 2000 loci.fixed.bed genes.fixed.bed 
scaffoldN   0   3276    .   1.76688149833334    +   RIP region  .   id="RIP_1077;RIP_length=3276;RIP_max=2.32797202797203;%%scaffoldN   3181    5110    genemark-scaffold100|size105690-processed-gene-0.34 .   +   maker   gene;scaffoldN  5243    11348   maker-scaffold100|size105690-augustus-gene-0.109    .   +   maker   gene
scaffoldN   0   3276    .   1.87247756983344    +   RIP region  .   id="RIP_310;RIP_length=3276;RIP_max=2.28728728728729;%%scaffoldN    3181    5110    genemark-scaffold100|size105690-processed-gene-0.34 .   +   maker   gene;scaffoldN  5243    11348   maker-scaffold100|size105690-augustus-gene-0.109    .   +   maker   gene
scaffoldN   0   3276    .   1.91911902719   +   RIP region  .   id="RIP_2272;RIP_length=3276;RIP_max=2.8247619047619;%%scaffoldN    3181    5110    genemark-scaffold100|size105690-processed-gene-0.34 .   +   maker   gene;scaffoldN  5243    11348   maker-scaffold100|size105690-augustus-gene-0.109    .   +   maker   gene
scaffoldN   0   5051    .   2.01247550534218    +   RIP region  .   id="RIP_1794;RIP_length=5051;RIP_max=3.13636363636364;%%scaffoldN   3181    5110    genemark-scaffold100|size105690-processed-gene-0.34 .   +   maker   gene;scaffoldN  5243    11348   maker-scaffold100|size105690-augustus-gene-0.109    .   +   maker   gene

How you fix up your scaffold name scheme is up to you, but the idea is that the two inputs need to have matching chromosome names, in order for any mapping of overlapping elements to work.

Note that I am replacing the delimiter between reference element and any mapped elements with %%, because your mapped elements contain in them the same delimiter (| in scaffold names) that bedmap uses as its default.

ADD REPLY
0
Entering edit mode

Thanks. I was not aware that there should be matching names.

ADD REPLY
0
Entering edit mode

Yes, these are BED files and so the first column should be a unique identifier for a chromosome, scaffold, or other collection of intervals you want to test for set overlap. If you have two or more groups, you would name them uniquely, as needed.

ADD REPLY

Login before adding your answer.

Traffic: 2224 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6