Overlap between ranges
1
0
Entering edit mode
4.9 years ago

I have two tab separated files, File_1 contains exonic (output by stringtie) information and its structure is like this:

e_id    chr strand  start   end rcount  ucount  mrcount cov cov_sd  mcov    mcov_sd
1   1   +   3631    3913    46  46  46  10.371  10.5056 10.371  10.5056
2   1   +   3996    4276    83  83  83  22.3559 4.7919  22.3559 4.7919
3   1   +   4486    4605    47  47  47  25.2333 7.4294  25.2333 7.4294
4   1   +   4706    5095    120 120 120 23.5718 3.9786  23.5718 3.9786

File_2 contains splice junctions (outout by STAR) information and its structure is like this:

chr start   end strand 
1   3914    3995    1
1   4277    4485    1   
1   4496    4505    1
1   4716    5075    1

* strand (0: undefined, 1: +, 2: -)

I am interested in script which first check chromosome number and then extract those lines in which start and end coordinates ($2 and $3) of file_2 lies within start and end coordinate ($4 and $5) of file_1, so the expected output will be overlapped rows from file_1 + rows from file_2. For example, start and end coordinate in third and fourth row of file_2 lies within third and fourth row of file_1 so the expected output will be:

e_id    chr strand  start   end rcount  ucount  mrcount cov cov_sd  mcov    mcov_sd chr start   end strand

3   1   +   4486    4605    47  47  47  25.2333 7.4294  25.2333 7.4294    1 4496    4505    1
4   1   +   4706    5095    120 120 120 23.5718 3.9786  23.5718 3.9786    1 4716    5075    1

Thanks in advance

RNA-Seq • 1.1k views
ADD COMMENT
2
Entering edit mode
4.9 years ago
AK ★ 2.2k

One idea is to make use of bedtools intersect (assuming the files are all tab separated):

# Pretend that they're in bed format
cut -f2,4,5 file_1 > file_1.bed
cut -f-3 file_2 > file_2.bed

# Use the starts and ends returned by bedtools intersect to query the original rows
grep -w -f <(bedtools intersect -a file_1.bed -b file_2.bed -wa | cut -f2,3) file_1 > file_1.intersect
grep -w -f <(bedtools intersect -a file_1.bed -b file_2.bed) file_2 > file_2.intersect

# Combine them
paste file_1.intersect file_2.intersect

Which will return:

3   1   +   4486    4605    47  47  47  25.2333 7.4294  25.2333 7.4294  1   4496    4505    1
4   1   +   4706    5095    120 120 120 23.5718 3.9786  23.5718 3.9786  1   4716    5075    1

But you might have to check if there are duplicated start and end from different chr in file_1... by cut -f4,5 file_1 | sort | uniq -c.

ADD COMMENT
0
Entering edit mode

Many thanks for your efforts, is it possible to place the matching rows from file_2 exactly in front of matching rows of file_1, currently its pasting all rows from file_1 and matched rows from file_2

ADD REPLY
0
Entering edit mode

Try swapping the files when you paste.

ADD REPLY

Login before adding your answer.

Traffic: 1548 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6