Remove Extra Info From Fasta Header
3
1
Entering edit mode
3.2 years ago
kbeavers97 ▴ 10

Hi all,

I'm trying to remove extra info from the headers in a fasta file. For example, one of the headers looks like this:

>TRINITY_DN2114_c0_g1_i25 len=23510 path=[5:0-984 11:985-5200 13:5201-5226 14:5227-7391 16:7392-11682 18:11683-12445 20:12446-17359 21:17360-17390 23:17391-19171 24:19172-19243 25:19244-21000 26:21001-21804 27:21805-22955 32:22956-23509]

And I want it to look like this:

>TRINITY_DN2114_c0_g1_i25

There are no quotation marks in the header, but I had to include them to include the ">" sign.

I've been trying to do this with sed, but I can't get it to work! Any help is greatly appreciated.

fasta RNA-Seq sed awk linux • 2.9k views
ADD COMMENT
0
Entering edit mode

You had to include the quotes because the formatting option you were using was the blockquote formatting option, which is not the right one here. You need to use the code formatting option (the 101010 button). It can be challenging to figure out the right option without some trial and error. I've fixed it for you now.

code_formatting

ADD REPLY
3
Entering edit mode
3.2 years ago

You can probably get away with just using cut, since you just want to get rid of everything after the first space in the header. cut -d" " -f1 file.fastq > newfile.fastq.

You can also use seqkit replace. Example 1 for the replace function is similar to what you want to do.

ADD COMMENT
1
Entering edit mode
3.2 years ago
GenoMax 141k

You can use reformat.sh from BBMap suite to remove everything after first space in header.

$ reformat.sh in=test.fa out=trimmed.fa trd=t
ADD COMMENT
1
Entering edit mode
3.2 years ago
$ awk '{print $1}' seq.fa

Assuming that sequences do not have any spaces within them

ADD COMMENT

Login before adding your answer.

Traffic: 1540 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6