Modifying Fasta file header
2
0
Entering edit mode
7.1 years ago
Rose • 0

Hi I still have problem to modify my multifasta file:

>gi|51039021|ref|NC_006130.1| Streptococcus pyogenes 71-724 plasmid pDN571, complete sequence
>gi|62945224|ref|NC_006976.1| Mannheimia haemolytica 3259 plasmid pCCK3259, complete sequence
>gi|63219713|ref|NC_006994.1| Pasteurella multocida 381 plasmid pCCK381, complete sequence

to

>gi|51039021|pDN571
>gi|62945224|pCCK3259
>gi|63219713|pCCK381

Please you help will be very useful. Thanks

sequence • 2.4k views
ADD COMMENT
1
Entering edit mode

Did you make two accounts to ask these types of similar questions (Blaise )? ref this thread from today: Modifying Fasta file header

ADD REPLY
0
Entering edit mode

We are two researchers, having everyone his own problem despite the fact that we are working in the same group. They were two problems, one found solution, the still need solution for the second one.

ADD REPLY
0
Entering edit mode

I see. You are using the example data the other person had posted which led to my question.

ADD REPLY
0
Entering edit mode

There are multiple threads that deal with these types of header manipulations. Have you tried to search through them?

ADD REPLY
0
Entering edit mode

yes, but without success

ADD REPLY
0
Entering edit mode

If you show what you tried and didn't work people will be more eager to point out your mistake or help you further, because you demonstrate you have put effort in this problem, too.

ADD REPLY
0
Entering edit mode

It might be helpful to know why you want to modify your headers in this fashion and what some of your other headers look like.

ADD REPLY
0
Entering edit mode

I would like to run a BLAT, having unique Fasta identifiers

ADD REPLY
0
Entering edit mode

This should be reasonably straightforward using biopython SeqIO.

ADD REPLY
0
Entering edit mode

Won't Biopython shorten the headers automatically, thus being unable to parse the header?

ADD REPLY
0
Entering edit mode

I'm not aware of that, what makes you think it does?

ADD REPLY
1
Entering edit mode

Most methods that access FASTA entries using the offsets stored in a *.fai file will truncate the header name at the first whitespace. However, Bio.SeqIO does not use this scheme. Both samtools and pyfaidx do, but there's a method in pyfaidx: FastaRecord.longname will recover the entire header name from the FASTA file and won't use the truncated version stored in the index file.

ADD REPLY
0
Entering edit mode

Thanks, Matt. I did a quick test using the OPs first example header, and it truncated to >gi|51039021|ref|NC_006130.1|

ADD REPLY
2
Entering edit mode

Apparently biopython uses the strict definition (if FASTA has any) of the ID as everything before the first space. See Can Biopython Properly Import Fasta Headers With Spaces In Them? To get the whole header you want SeqRecord.description not SeqRecord.id

ADD REPLY
2
Entering edit mode
7.1 years ago

Assuming there is consistency in header formatting:

$ awk '{ if ($0 !~ /^>/) { print $0; } else { m = split($0, a, "|"); n = split(a[5], b, " "); print a[1]"|"a[2]"|"b[5]; } }' in.fa > out.fa

Buyer beware: If anything changes in the header format, you may not get an expected result. No refunds!

ADD COMMENT
1
Entering edit mode
7.1 years ago
GenoMax 141k

Try this version as well:

$ awk -F '[ |,]' '{ if ($0 ~ /^>/) { print $1"|"$2"|"$10;} else { print $0;}}' input.fa > output.fa
ADD COMMENT

Login before adding your answer.

Traffic: 2736 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6