I posted this on StackOverflow and the guys over there recommended I post it here.
I have a 9 column .BED file containing the yeast genome, with coordinates for all ORFs, and some of the 5' and 3' UTRs.
I'm looking to combine rows within it based on the same gene name in column 9.
In the example below: rows 3, 4, 5 all have YAR014C in column 9 [3' UTR, gene, 5' UTR respectively]
Then replace the value in columns 4 and 5 (start and end coordinates) to be the column 4 value of the original line with '3UTR' in it, and the column 5 value of the original line with '5UTR' in it.
i.e. Condensing the entire gene into one, from 5' UTR start to 3' UTR end
The whole file doesn't follow the 3UTR, gene, 5UTR naming convention in column 9, so it would have to based on the specific value in column 9, rather than on row number.
Here's a portion of the file:
I martin exon 160597 164187 . - . gene_id "YAR009C_ORF";
I martin exon 164544 165866 . - . gene_id "YAR010C_ORF";
I martin exon 166574 166741 . - . gene_id "YAR014C_3UTR";
I martin exon 166742 168871 . - . gene_id "YAR014C_ORF";
I martin exon 168872 169022 . - . gene_id "YAR014C_5UTR";
I martin exon 170352 170395 . - . gene_id "YAR018C_3UTR";
I martin exon 170396 171703 . - . gene_id "YAR018C_ORF";
I martin exon 171704 171743 . - . gene_id "YAR018C_5UTR";
I martin exon 172136 172210 . - . gene_id "YAR019C_3UTR";
I martin exon 172211 175135 . - . gene_id "YAR019C_ORF";
I martin exon 176856 177023 . - . gene_id "YAR020C_ORF";
I martin exon 179241 179280 . - . gene_id "YAR023C_3UTR";
I martin exon 179281 179820 . - . gene_id "YAR023C_ORF";
I martin exon 179821 180087 . - . gene_id "YAR023C_5UTR";
I martin exon 186512 186853 . - . gene_id "YAR030C_ORF";
So the result I'd like for rows 3,4,5 would be:
I martin exon 166574 169022 . - . gene_id "YAR014C";
Thank you for taking the time to look at this!
I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
duplicate: Merging position of all different CDS of a single gene in one line
This worked - thank you. Fixed the problem.
nice, close this post please.
Check
bedtools merge
and its options-c
and-o
. From there on it is only a matter of string splitting to get the correct $9. Can be done withawk
. Try it out, best way to learn ;-)What have you tried before posting at SO and here? Can you paste any code that you'd written to attempt to do this?
I'd approach by storing the coordinates relative to each gene and output as
min
andmax
per gene.Eric Lim does it apply for both genes on + and - strands ?
I don't think it'd make any difference within the context of what the OP asked.