Awk, Sed, Grep to create a sequential numerical text replacement in my gtf file
1
0
Entering edit mode
5.3 years ago
sdbaney ▴ 10

Hi all, I'm new to awk, sed, and grep but I'm thinking they can help me with something.

I have a self-made gtf file that I'm using as a mask file for differential expression analyses (transcripts from my de novo assembly that contains rRNA contamination) but I need to change the final column. They must all be unique and in a particular format.

For example, right now I have this:

TRINITY_DN27462_c57_g1   CEH   transcript   1   3012   1000   +    *    gene_id exclude.1
TRINITY_DN27462_c57_g2   CEH   transcript   1   1224   1000   +    *    gene_id exclude.2
TRINITY_DN27462_c57_g3   CEH   transcript   1    539    1000   +    *    gene_id exclude.3
TRINITY_DN27098_c57_g4   CEH   transcript   1    350    1000   +    *    gene_id  exclude.4

but I need it to be this:

TRINITY_DN27462_c57_g1   CEH   transcript   1   3012    1000   +    *    gene_id "exclude.1"; transcript_id "exclude.1.1"
TRINITY_DN27462_c57_g2   CEH   transcript   1   1224    1000   +    *    gene_id "exclude.2"; transcript_id "exclude.2.1"
TRINITY_DN27462_c57_g3   CEH   transcript   1     539    1000   +    *    gene_id "exclude.3"; transcript_id "exclude.3.1"
TRINITY_DN27098__c7_g1   CEH   transcript   1     350    1000   +    *    gene_id "exclude.4"; transcript_id "exclude.4.1"

so the final column needs to have this string of text that includes increase by 1 in two places for each row. I have 493 lines in my gtf file so there has to be a way to easily to do this. Can anyone give me some tips/pointers?

awk sed grep • 1.3k views
ADD COMMENT
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLY
0
Entering edit mode

some thing like this:

$ awk -v OFS="\t" '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,"\""$10"\"; transcript_id " "\""$10".1\""}' test.txt
TRINITY_DN27462_c57_g1  CEH transcript  1   3012    1000    +   *   gene_id "exclude.1"; transcript_id "exclude.1.1"
TRINITY_DN27462_c57_g2  CEH transcript  1   1224    1000    +   *   gene_id "exclude.2"; transcript_id "exclude.2.1"
TRINITY_DN27462_c57_g3  CEH transcript  1   539 1000    +   *   gene_id "exclude.3"; transcript_id "exclude.3.1"
TRINITY_DN27098_c57_g4  CEH transcript  1   350 1000    +   *   gene_id "exclude.4"; transcript_id "exclude.4.1"
ADD REPLY
3
Entering edit mode
5.3 years ago
Ram 43k
sed -r 's/gene_id([ ]+)exclude[.]([0-9]+)/gene_id\1"exclude\2"; transcript_id\1"exclude\2.1"/'

If you only want a hint: Capture the number after exclude and use it with a backreference, appending .1 where required.

ADD COMMENT

Login before adding your answer.

Traffic: 2498 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6