Question

Customize fasta headers

0

Entering edit mode

7.9 years ago

bkvijay.jayaraman • 0

Hi, I have a fasta file with 300 protein sequences. I intend to construct a phylogenetic tree with it. I would want only the accession number and the organism name in the fasta header and remove the rest of the information. Can anybody suggest how to do this? I have a linux based system with perl and python installed.

For example, i want to convert a header like this:

 >gi|685204428|gb|AIN98665.1| fumarate hydratase, putative [Leishmania panamensis]

to a header like this

>Leishmania panamensis| AIN98665.1

Some sequences have multiple headers. Would that be a problem?

regards Vijay

sequence edit • 3.3k views

ADD COMMENT • link updated 7.9 years ago by moranr ▴ 290 • written 7.9 years ago by bkvijay.jayaraman • 0

4

Entering edit mode

Strictly speaking, yours is not a right FASTA. Anything following the first space/tab is not part of the sequence name. Renaming fasta like this may confuse other tools.

ADD REPLY • link 7.9 years ago by lh3 33k

score 2 · Answer 1 · 2016-05-12

2

Entering edit mode

7.9 years ago

Ram 43k

Sequences cannot have multiple headers, AFAIK. Are you sure those are not just headers with empty sequences?

You can use a combination of bioawk and sed to work for you. The sed command would be like:

echo $header | sed -re 's/^>.*gb[|]([A-Z0-9.]+)[|].*[[]([aA-zZ ]+)[]]/>\2| \1/'

ADD COMMENT • link 7.9 years ago by Ram 43k

1

Entering edit mode

"Block quotes" in post editor used by OP was messing up the display. The headers are regular fasta type (see above).

ADD REPLY • link 7.9 years ago by GenoMax 141k

0

Entering edit mode

I ignored that part - my regex includes the > sign in the header. It's OP's statements on "multiple headers" that has me curious.

ADD REPLY • link 7.9 years ago by Ram 43k

score 0 · Answer 2 · 2016-05-12

HI,

You could use biopython, something like (ive done no testing)

from Bio import SeqIO

fasta_sequences = SeqIO.parse(open(file),'fasta') 
output_file="editedHeaders.fa" 
myRecords=list() 
for fasta in fasta_sequences:
            originalID=fasta.id 
            newIDList=fasta.description.split('[')
            species=newIDList[-1].replace(']','')
            newID=species + "|" + newIDList[0].split('|')[3] 
            fasta.id=newID
            fasta.description=""
            myRecords.append(fasta)

SeqIO.write(myRecords, output_file, "fasta")

I would also add a dict to save the original headers and pickle it so you can look up if needed.

Good luck