Question: Overlap between ranges
0
Entering edit mode
3 months ago
waqaskhokhar999 • 40

I have two tab separated files, File_1 contains exonic (output by stringtie) information and its structure is like this:

e_id    chr strand  start   end rcount  ucount  mrcount cov cov_sd  mcov    mcov_sd
1   1   +   3631    3913    46  46  46  10.371  10.5056 10.371  10.5056
2   1   +   3996    4276    83  83  83  22.3559 4.7919  22.3559 4.7919
3   1   +   4486    4605    47  47  47  25.2333 7.4294  25.2333 7.4294
4   1   +   4706    5095    120 120 120 23.5718 3.9786  23.5718 3.9786

File_2 contains splice junctions (outout by STAR) information and its structure is like this:

chr start   end strand 
1   3914    3995    1
1   4277    4485    1   
1   4496    4505    1
1   4716    5075    1

* strand (0: undefined, 1: +, 2: -)

I am interested in script which first check chromosome number and then extract those lines in which start and end coordinates ($2 and $3) of file_2 lies within start and end coordinate ($4 and $5) of file_1, so the expected output will be overlapped rows from file_1 + rows from file_2. For example, start and end coordinate in third and fourth row of file_2 lies within third and fourth row of file_1 so the expected output will be:

e_id    chr strand  start   end rcount  ucount  mrcount cov cov_sd  mcov    mcov_sd chr start   end strand

3   1   +   4486    4605    47  47  47  25.2333 7.4294  25.2333 7.4294    1 4496    4505    1
4   1   +   4706    5095    120 120 120 23.5718 3.9786  23.5718 3.9786    1 4716    5075    1

Thanks in advance

ADD COMMENTlink 3 months ago waqaskhokhar999 • 40 • updated 3 months ago SMK ♦ 1.3k
0
Entering edit mode
3 months ago
SMK ♦ 1.3k
Ghent, Belgium

One idea is to make use of bedtools intersect (assuming the files are all tab separated):

# Pretend that they're in bed format
cut -f2,4,5 file_1 > file_1.bed
cut -f-3 file_2 > file_2.bed

# Use the starts and ends returned by bedtools intersect to query the original rows
grep -w -f <(bedtools intersect -a file_1.bed -b file_2.bed -wa | cut -f2,3) file_1 > file_1.intersect
grep -w -f <(bedtools intersect -a file_1.bed -b file_2.bed) file_2 > file_2.intersect

# Combine them
paste file_1.intersect file_2.intersect

Which will return:

3   1   +   4486    4605    47  47  47  25.2333 7.4294  25.2333 7.4294  1   4496    4505    1
4   1   +   4706    5095    120 120 120 23.5718 3.9786  23.5718 3.9786  1   4716    5075    1

But you might have to check if there are duplicated start and end from different chr in file_1... by cut -f4,5 file_1 | sort | uniq -c.

ADD COMMENTlink 3 months ago SMK ♦ 1.3k

Login before adding your answer.

Powered by the version 1.5