Question

Identify reads overlapping a junction in gff file

0

Entering edit mode

9.3 years ago

jkhurana2311 • 0

Dear all,

I have a gff annotation file with coordinates for an overlapping genomic feature, e.g.

chr1 1-100
chr1 100-250
chr1 250-300
chr1 300-500

and an alignment file (bam/bed) with reads mapping to that chromosome. I need to extract reads that are specifically overlapping the junctions of that genomic feature. So, in a bed file like this,

chr1 50-70
chr1 90-110
chr1 150-170
chr1 245-255

Only extract

chr1 90-110
chr1 245-255

Any existing tools that can help me do that. I know bedtools intersect will give me all the common overlap but I am only looking for reads mapping the junctions of concurrent genomic features. I am not well versed in bioinformatics so I can't write my own scripts but am able to run existing tools.

Any ideas would be appreciated.

Thanks

junction gff overlap • 2.9k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by jkhurana2311 • 0

0

Entering edit mode

Search this site/the web for bedtools intersect

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks for the idea, Istvan Albert. That actually makes a lot of sense. I can do that.

Thanks again

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by jkhurana2311 • 0

Ram · Answer 1 · 2014-12-21

One way to do this is to make a pipeline that:

Converts the annotations to BED via convert2bed -i gff
Uses awk to get one base elements, if the previous stop and current start position values are equal
Uses BEDOPS bedops --range to expand (for example) a 10-base window around the resulting one-base elements (adjusting this window to whatever you need)
Maps the ranged elements to find overlapping reads via bedmap and convert2bed -i bam

$ convert2bed -i gff < annotations.gff \
    | awk ' \
        BEGIN { \
            previous_chr = "chr0"; \
            previous_stop = -1; \
        } \
        { \
            current_chr = $1; \
            current_start = $2; \
            current_stop = $3; \
            if ((current_start == previous_stop) && (current_chr == previous_chr)) { \
                print current_chr"\t"current_start"\t"(current_start + 1); \
            } \
           previous_chr = current_chr; \
           previous_stop = current_stop; \
        } \
    }' \
    | bedops --everything --range 10 - \
    | bedmap --echo --echo-map --skip-unmapped - <(convert2bed -i bam < reads.bam) \
    > answer.bed

Ram · Answer 2 · 2014-12-20

Often the way to solve these problems lies in building one ore more new dataset that when used with bedtools intersect produces what you need.

For example here create file that contains a marker in the form of a 1bp long interval for its start of each feature. Then in another file create markers for the 1bp interval of the end of each feature. Now intersect the two files and you'll end up with an file with intervals that are common. Use this file to intersect with the reads.