Find gene ids based on genome coordinates
0
0
Entering edit mode
4.5 years ago

I have two tab-delimited files, File1 contains the gene coordinates and gene ids whereas File2 contains genome coordinates obtained via the output of ASTALAVISTA.

Structure of File1:

1   3631    5899    +   AT1G01010
1   5928    8737    -   AT1G01020
1   11649   13714   -   AT1G01030
1   23146   31227   +   AT1G01040

Structure of File2:

1   4276    4605    +
1   4486    5095    +
1   4276    5095    +
1   23138   24451   +
1   24451   24655   +

I have extracted the first four columns of file1 using the command (cut -f-4 File1 > File1_new) so that the structure of both files looks same:

1   3631    5899    +
1   5928    8737    -
1   11649   13714   -
1   23146   31227   +

Now I am using bedtools intersect to find File2 coordinates that are present inside the coordinates of File1 with gene ids from File1 using commands:

grep -w -f <(bedtools intersect -a File1_new -b File2 -wa | cut -f1,2,3,4) File1 > file_1.intersect

output:

1   3631    5899    +   AT1G01010
1   5928    8737    -   AT1G01020
1   11649   13714   -   AT1G01030
1   23146   31227   +   AT1G01040

grep -w -f <(bedtools intersect -a File1_new -b File2) File2 > file_2.intersect

output:

1   4276    4605    +
1   4486    5095    +
1   4276    5095    +
1   24451   24655   +

As a result, I am getting the coordinates (as you can see the coordinate ( 1 23138 24451 +) was not present within the range of File1 so it got skipped) but missing the gene ids, is it possible to get the gene ids also so that output looks like:

1   4276    4605    +   AT1G01010
1   4486    5095    +   AT1G01010
1   4276    5095    +   AT1G01010
1   24451   24655   +   AT1G01040

Secondly is it also possible to get the gene ids and distance of the closest gene for coordinates ( 1 23138 24451 +) that got skipped to get output like:

1   4276    4605    +   AT1G01010   within
1   4486    5095    +   AT1G01010   within
1   4276    5095    +   AT1G01010   within
1   23138   24451   +   AT1G01040   -8bp
1   24451   24655   +   AT1G01040   within

Any help will be highly appreciated.

RNA-Seq gene • 838 views
ADD COMMENT
0
Entering edit mode

Getting the genes within those ranges would be reasonably easy.

Getting the nearest gene and outputting, say, -8bp away, would be tricker but not impossible. When you say it's x bp away though, how would you measure this? From the start of the gene? From the nearest bp within the CDS?

What if your coordinates are within a gene but don't fully cover it? What if the coordinates half cover a gene?

ADD REPLY
0
Entering edit mode

Many thanks for your response, I have checked the reason for ids that don't match. I have estimated the AS events from assemblies generated by stringtie. The coordinates in stringtie vary from TAIR10 coordinates, so I need to extract the gene coordinates from stringtie assembly and then have to assign gene ids based on those coordinates.

ADD REPLY
0
Entering edit mode

Then please answer my questions...

ADD REPLY

Login before adding your answer.

Traffic: 2140 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6