Question

Awk, Sed, Grep to create a sequential numerical text replacement in my gtf file

0

Entering edit mode

5.3 years ago

sdbaney ▴ 10

Hi all, I'm new to awk, sed, and grep but I'm thinking they can help me with something.

I have a self-made gtf file that I'm using as a mask file for differential expression analyses (transcripts from my de novo assembly that contains rRNA contamination) but I need to change the final column. They must all be unique and in a particular format.

For example, right now I have this:

TRINITY_DN27462_c57_g1   CEH   transcript   1   3012   1000   +    *    gene_id exclude.1
TRINITY_DN27462_c57_g2   CEH   transcript   1   1224   1000   +    *    gene_id exclude.2
TRINITY_DN27462_c57_g3   CEH   transcript   1    539    1000   +    *    gene_id exclude.3
TRINITY_DN27098_c57_g4   CEH   transcript   1    350    1000   +    *    gene_id  exclude.4

but I need it to be this:

TRINITY_DN27462_c57_g1   CEH   transcript   1   3012    1000   +    *    gene_id "exclude.1"; transcript_id "exclude.1.1"
TRINITY_DN27462_c57_g2   CEH   transcript   1   1224    1000   +    *    gene_id "exclude.2"; transcript_id "exclude.2.1"
TRINITY_DN27462_c57_g3   CEH   transcript   1     539    1000   +    *    gene_id "exclude.3"; transcript_id "exclude.3.1"
TRINITY_DN27098__c7_g1   CEH   transcript   1     350    1000   +    *    gene_id "exclude.4"; transcript_id "exclude.4.1"

so the final column needs to have this string of text that includes increase by 1 in two places for each row. I have 493 lines in my gtf file so there has to be a way to easily to do this. Can anyone give me some tips/pointers?

awk sed grep • 1.3k views

ADD COMMENT • link updated 5.3 years ago by Ram 43k • written 5.3 years ago by sdbaney ▴ 10

0

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLY • link 5.3 years ago by Ram 43k

0

Entering edit mode

some thing like this:

$ awk -v OFS="\t" '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,"\""$10"\"; transcript_id " "\""$10".1\""}' test.txt
TRINITY_DN27462_c57_g1  CEH transcript  1   3012    1000    +   *   gene_id "exclude.1"; transcript_id "exclude.1.1"
TRINITY_DN27462_c57_g2  CEH transcript  1   1224    1000    +   *   gene_id "exclude.2"; transcript_id "exclude.2.1"
TRINITY_DN27462_c57_g3  CEH transcript  1   539 1000    +   *   gene_id "exclude.3"; transcript_id "exclude.3.1"
TRINITY_DN27098_c57_g4  CEH transcript  1   350 1000    +   *   gene_id "exclude.4"; transcript_id "exclude.4.1"

ADD REPLY • link 5.3 years ago by cpad0112 21k

score 3 · Answer 1 · 2018-12-27

3

Entering edit mode

5.3 years ago

Ram 43k

sed -r 's/gene_id([ ]+)exclude[.]([0-9]+)/gene_id\1"exclude\2"; transcript_id\1"exclude\2.1"/'

If you only want a hint: Capture the number after exclude and use it with a backreference, appending .1 where required.

ADD COMMENT • link 5.3 years ago by Ram 43k