Text file: combining lines together based on value within certain column
0
0
Entering edit mode
4.9 years ago

I posted this on StackOverflow and the guys over there recommended I post it here.

I have a 9 column .BED file containing the yeast genome, with coordinates for all ORFs, and some of the 5' and 3' UTRs.

I'm looking to combine rows within it based on the same gene name in column 9.

In the example below: rows 3, 4, 5 all have YAR014C in column 9 [3' UTR, gene, 5' UTR respectively]

Then replace the value in columns 4 and 5 (start and end coordinates) to be the column 4 value of the original line with '3UTR' in it, and the column 5 value of the original line with '5UTR' in it.

i.e. Condensing the entire gene into one, from 5' UTR start to 3' UTR end

The whole file doesn't follow the 3UTR, gene, 5UTR naming convention in column 9, so it would have to based on the specific value in column 9, rather than on row number.

Here's a portion of the file:

I   martin  exon    160597  164187  .   -   .   gene_id "YAR009C_ORF";
I   martin  exon    164544  165866  .   -   .   gene_id "YAR010C_ORF";
I   martin  exon    166574  166741  .   -   .   gene_id "YAR014C_3UTR";
I   martin  exon    166742  168871  .   -   .   gene_id "YAR014C_ORF";
I   martin  exon    168872  169022  .   -   .   gene_id "YAR014C_5UTR";
I   martin  exon    170352  170395  .   -   .   gene_id "YAR018C_3UTR";
I   martin  exon    170396  171703  .   -   .   gene_id "YAR018C_ORF";
I   martin  exon    171704  171743  .   -   .   gene_id "YAR018C_5UTR";
I   martin  exon    172136  172210  .   -   .   gene_id "YAR019C_3UTR";
I   martin  exon    172211  175135  .   -   .   gene_id "YAR019C_ORF";
I   martin  exon    176856  177023  .   -   .   gene_id "YAR020C_ORF";
I   martin  exon    179241  179280  .   -   .   gene_id "YAR023C_3UTR";
I   martin  exon    179281  179820  .   -   .   gene_id "YAR023C_ORF";
I   martin  exon    179821  180087  .   -   .   gene_id "YAR023C_5UTR";
I   martin  exon    186512  186853  .   -   .   gene_id "YAR030C_ORF";

So the result I'd like for rows 3,4,5 would be:

I     martin      exon       166574       169022      .      -      .       gene_id "YAR014C";

Thank you for taking the time to look at this!

genome sequence • 1.2k views
ADD COMMENT
1
Entering edit mode

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY
0
Entering edit mode

This worked - thank you. Fixed the problem.

ADD REPLY
0
Entering edit mode

nice, close this post please.

ADD REPLY
0
Entering edit mode

Check bedtools merge and its options -c and -o. From there on it is only a matter of string splitting to get the correct $9. Can be done with awk. Try it out, best way to learn ;-)

ADD REPLY
0
Entering edit mode

What have you tried before posting at SO and here? Can you paste any code that you'd written to attempt to do this?

I'd approach by storing the coordinates relative to each gene and output as min and max per gene.

ADD REPLY
0
Entering edit mode

Eric Lim does it apply for both genes on + and - strands ?

ADD REPLY
0
Entering edit mode

I don't think it'd make any difference within the context of what the OP asked.

ADD REPLY

Login before adding your answer.

Traffic: 1929 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6