Question

Remove Extra Info From Fasta Header

1

Entering edit mode

3.2 years ago

kbeavers97 ▴ 10

Hi all,

I'm trying to remove extra info from the headers in a fasta file. For example, one of the headers looks like this:

>TRINITY_DN2114_c0_g1_i25 len=23510 path=[5:0-984 11:985-5200 13:5201-5226 14:5227-7391 16:7392-11682 18:11683-12445 20:12446-17359 21:17360-17390 23:17391-19171 24:19172-19243 25:19244-21000 26:21001-21804 27:21805-22955 32:22956-23509]

And I want it to look like this:

>TRINITY_DN2114_c0_g1_i25

There are no quotation marks in the header, but I had to include them to include the ">" sign.

I've been trying to do this with sed, but I can't get it to work! Any help is greatly appreciated.

fasta RNA-Seq sed awk linux • 2.9k views

ADD COMMENT • link updated 3.2 years ago by GenoMax 141k • written 3.2 years ago by kbeavers97 ▴ 10

0

Entering edit mode

You had to include the quotes because the formatting option you were using was the blockquote formatting option, which is not the right one here. You need to use the code formatting option (the 101010 button). It can be challenging to figure out the right option without some trial and error. I've fixed it for you now.

code_formatting

ADD REPLY • link 3.2 years ago by Ram 43k

score 3 · Answer 1 · 2021-02-02

3

Entering edit mode

3.2 years ago

rpolicastro 13k

You can probably get away with just using cut, since you just want to get rid of everything after the first space in the header. cut -d" " -f1 file.fastq > newfile.fastq.

You can also use seqkit replace. Example 1 for the replace function is similar to what you want to do.

ADD COMMENT • link 3.2 years ago by rpolicastro 13k

score 1 · Answer 2 · 2021-02-02

1

Entering edit mode

3.2 years ago

GenoMax 141k

You can use reformat.sh from BBMap suite to remove everything after first space in header.

$ reformat.sh in=test.fa out=trimmed.fa trd=t

ADD COMMENT • link 3.2 years ago by GenoMax 141k

score 1 · Answer 3 · 2021-02-03

1

Entering edit mode

3.2 years ago

cpad0112 21k

$ awk '{print $1}' seq.fa

Assuming that sequences do not have any spaces within them

ADD COMMENT • link 3.2 years ago by cpad0112 21k