Best way to generate a bed file
2
0
Entering edit mode
6.7 years ago
elisheva ▴ 120

Hello everyone!

I have to do something and I kind of lost. I have 2 tab delimited text files which contains exons coordinates.

The first file contains the start coordinates (for example):

NM_032291   chr1    +   66999638    67091529    67098752    67101626
NM_001308203    chr1    +   66999251    66999928    67091529    67098752    67105459    67108492

and the second file - contains the end coordinates:

NM_032291   chr1    +   67000051    67091593    67098777    67101698
NM_001308203    chr1    +   66999355    67000051    67091593    67098777    67105516    67108547

I'd like to have multiple bed files for each exon(for example for the first exon):

 chr1    66999638     67000051     NM_032291   length    +
 chr1    66999251     66999355     NM_001308203    length    +

Each gene contains different number of exons - so the number of the columns is unknown. I believe there is a very simple way to do it, I've tried awk but without success.

Thanks!

bed • 1.6k views
ADD COMMENT
2
Entering edit mode
6.7 years ago
from itertools import izip

file1="file1.txt"
file2="file2.txt"

with open(file1) as f1, open(file2) as f2:
    for start_cords,end_cords in izip(f1,f2):
        start_cor = start_cords.strip().split("\t")
        end_cor = end_cords.strip().split("\t")
        for i in range(3,len(start_cor)):
            chr = start_cor[1]
            start = start_cor[i]
            end = end_cor[i]
            name = start_cor[0]
            length = int(end_cor[i])-int(start_cor[i])
            strand = start_cor[2]
            print '{0}\t{1}\t{2}\t{3}\t{4}\t{5}'.format(chr,start,end,name,length,strand)

Output:

chr1    66999638    67000051    NM_032291   413 +
chr1    67091529    67091593    NM_032291   64  +
chr1    67098752    67098777    NM_032291   25  +
chr1    67101626    67101698    NM_032291   72  +
chr1    66999251    66999355    NM_001308203    104 +
chr1    66999928    67000051    NM_001308203    123 +
chr1    67091529    67091593    NM_001308203    64  +
chr1    67098752    67098777    NM_001308203    25  +
chr1    67105459    67105516    NM_001308203    57  +
chr1    67108492    67108547    NM_001308203    55  +

Note: Its a python script.

ADD COMMENT
1
Entering edit mode
6.7 years ago
venu 7.1k

I don't understand why start coordinate is in one file and the end is in another. Albeit, you haven't provided what you've tried (which is not a best practice :) ), you can try the following

join <(sort -k1 start_file.txt) <(sort -k1 end_file.txt) | tr ' ' '\t' | awk -F'\t' '{print $1"\t"$2"\t"$4"\t"$10"\t"$10-$4"\t"$9}'

Explanation:

  • Join all columns in 2 files if first column in 2 files matches
  • Print required columns i.e. start and end of the NM_**

result

NM_032291       chr1    66999638        67000051        413     +
ADD COMMENT
0
Entering edit mode

Thanks! About your question: that's my data.... But I have one more problem - I just gave an example - the number of exons for each gene is different.

ADD REPLY
0
Entering edit mode

the number of exons for each gene is different.

You mean you have same ids multiple times in first column in a same file?

ADD REPLY
0
Entering edit mode

I edited my post, so you can see now.... And no, I have one Id per line and then all the exons coordinates

ADD REPLY
0
Entering edit mode

then the solution I provided should work fine. Did you get any error?

ADD REPLY
0
Entering edit mode

Maybe I didn't understand your answer... what does it mean $10? Isn't it simply column 10?

ADD REPLY
0
Entering edit mode

Yes, but after combining two files into one. I added explanation.

ADD REPLY

Login before adding your answer.

Traffic: 1609 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6