Question

Modifying Fasta file header

0

Entering edit mode

7.1 years ago

Rose • 0

Hi I still have problem to modify my multifasta file:

>gi|51039021|ref|NC_006130.1| Streptococcus pyogenes 71-724 plasmid pDN571, complete sequence
>gi|62945224|ref|NC_006976.1| Mannheimia haemolytica 3259 plasmid pCCK3259, complete sequence
>gi|63219713|ref|NC_006994.1| Pasteurella multocida 381 plasmid pCCK381, complete sequence

to

>gi|51039021|pDN571
>gi|62945224|pCCK3259
>gi|63219713|pCCK381

Please you help will be very useful. Thanks

sequence • 2.4k views

ADD COMMENT • link updated 7.1 years ago by GenoMax 141k • written 7.1 years ago by Rose • 0

1

Entering edit mode

Did you make two accounts to ask these types of similar questions (Blaise )? ref this thread from today: Modifying Fasta file header

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

We are two researchers, having everyone his own problem despite the fact that we are working in the same group. They were two problems, one found solution, the still need solution for the second one.

ADD REPLY • link 7.1 years ago by Rose • 0

0

Entering edit mode

I see. You are using the example data the other person had posted which led to my question.

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

There are multiple threads that deal with these types of header manipulations. Have you tried to search through them?

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

yes, but without success

ADD REPLY • link 7.1 years ago by Rose • 0

0

Entering edit mode

If you show what you tried and didn't work people will be more eager to point out your mistake or help you further, because you demonstrate you have put effort in this problem, too.

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

0

Entering edit mode

It might be helpful to know why you want to modify your headers in this fashion and what some of your other headers look like.

ADD REPLY • link 7.1 years ago by Matt Shirley 10k

0

Entering edit mode

I would like to run a BLAT, having unique Fasta identifiers

ADD REPLY • link 7.1 years ago by Rose • 0

0

Entering edit mode

This should be reasonably straightforward using biopython SeqIO.

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

0

Entering edit mode

Won't Biopython shorten the headers automatically, thus being unable to parse the header?

ADD REPLY • link 7.1 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

I'm not aware of that, what makes you think it does?

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

1

Entering edit mode

Most methods that access FASTA entries using the offsets stored in a *.fai file will truncate the header name at the first whitespace. However, Bio.SeqIO does not use this scheme. Both samtools and pyfaidx do, but there's a method in pyfaidx: FastaRecord.longname will recover the entire header name from the FASTA file and won't use the truncated version stored in the index file.

ADD REPLY • link 7.1 years ago by Matt Shirley 10k

0

Entering edit mode

Thanks, Matt. I did a quick test using the OPs first example header, and it truncated to >gi|51039021|ref|NC_006130.1|

ADD REPLY • link 7.1 years ago by st.ph.n ★ 2.7k

2

Entering edit mode

Apparently biopython uses the strict definition (if FASTA has any) of the ID as everything before the first space. See Can Biopython Properly Import Fasta Headers With Spaces In Them? To get the whole header you want SeqRecord.description not SeqRecord.id

ADD REPLY • link 7.1 years ago by Matt Shirley 10k

score 2 · Answer 1 · 2017-03-15

2

Entering edit mode

7.1 years ago

Alex Reynolds 35k

Assuming there is consistency in header formatting:

$ awk '{ if ($0 !~ /^>/) { print $0; } else { m = split($0, a, "|"); n = split(a[5], b, " "); print a[1]"|"a[2]"|"b[5]; } }' in.fa > out.fa

Buyer beware: If anything changes in the header format, you may not get an expected result. No refunds!

ADD COMMENT • link 7.1 years ago by Alex Reynolds 35k

score 1 · Answer 2 · 2017-03-15

1

Entering edit mode

7.1 years ago

GenoMax 141k

Try this version as well:

$ awk -F '[ |,]' '{ if ($0 ~ /^>/) { print $1"|"$2"|"$10;} else { print $0;}}' input.fa > output.fa

ADD COMMENT • link 7.1 years ago by GenoMax 141k