Customize fasta headers
2
0
Entering edit mode
7.9 years ago

Hi, I have a fasta file with 300 protein sequences. I intend to construct a phylogenetic tree with it. I would want only the accession number and the organism name in the fasta header and remove the rest of the information. Can anybody suggest how to do this? I have a linux based system with perl and python installed.

For example, i want to convert a header like this:

 >gi|685204428|gb|AIN98665.1| fumarate hydratase, putative [Leishmania panamensis]

to a header like this

>Leishmania panamensis| AIN98665.1

Some sequences have multiple headers. Would that be a problem?

regards Vijay

sequence edit • 3.3k views
ADD COMMENT
4
Entering edit mode

Strictly speaking, yours is not a right FASTA. Anything following the first space/tab is not part of the sequence name. Renaming fasta like this may confuse other tools.

ADD REPLY
2
Entering edit mode
7.9 years ago
Ram 43k

Sequences cannot have multiple headers, AFAIK. Are you sure those are not just headers with empty sequences?

You can use a combination of bioawk and sed to work for you. The sed command would be like:

echo $header | sed -re 's/^>.*gb[|]([A-Z0-9.]+)[|].*[[]([aA-zZ ]+)[]]/>\2| \1/'
ADD COMMENT
1
Entering edit mode

"Block quotes" in post editor used by OP was messing up the display. The headers are regular fasta type (see above).

ADD REPLY
0
Entering edit mode

I ignored that part - my regex includes the > sign in the header. It's OP's statements on "multiple headers" that has me curious.

ADD REPLY
0
Entering edit mode
7.9 years ago
moranr ▴ 290

HI,

You could use biopython, something like (ive done no testing)

from Bio import SeqIO

fasta_sequences = SeqIO.parse(open(file),'fasta') 
output_file="editedHeaders.fa" 
myRecords=list() 
for fasta in fasta_sequences:
            originalID=fasta.id 
            newIDList=fasta.description.split('[')
            species=newIDList[-1].replace(']','')
            newID=species + "|" + newIDList[0].split('|')[3] 
            fasta.id=newID
            fasta.description=""
            myRecords.append(fasta)

SeqIO.write(myRecords, output_file, "fasta")

I would also add a dict to save the original headers and pickle it so you can look up if needed.

Good luck

ADD COMMENT

Login before adding your answer.

Traffic: 1356 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6