Tutorial:Add count numbers to headers in a fasta file
2
1
Entering edit mode
7.4 years ago
>a
ACTCTAAAT

>b
AAAAACCCT

etc.

To

>a_1
ACTCTAAAT

>b_2
AAAAACCCT
awk '/^>/{$0=$0"_"(++i)}1'  in > out
genome • 5.7k views
ADD COMMENT
0
Entering edit mode

Could you expand a bit on your post? What's the purpose of doing this?

ADD REPLY
0
Entering edit mode

Soory, I just want to record this. thanks for you concern, I will explain more for next time.

ZQ

ADD REPLY
3
Entering edit mode
7.4 years ago

Another way to do it, which works with single-line FASTA input:

$ awk 'BEGIN{RS=">"}{if(NR>1)print ">"$1"_"(NR-1)"\n"$2}' input.fa > output.fa

A second way, which allows multiline FASTA input:

$ awk 'BEGIN{RS=">";OFS="\n"}(NR>1){print ">"$1"_"(NR-1)"\n";$1="";print $0}' input.fa | awk '$0' > output.fa
ADD COMMENT
0
Entering edit mode

Hi - I know this post is quite old now, but I tried the above code for the multi-line fasta and it produced a fasta file that had only the names and then the number, but none of the sequences. I would like to add the numbers to the headers, but keeping the sequences intact in the output file - Is there any way to do this? I have been trying to resolve a makeblastdb error saying I have duplicate seqids and I was hoping this approach might resolve the error.

Edit: I tried the line from Alex Reynolds: awk '/^>/{$0=$0"_"(++i)}1' in > out
And it successfully added a number at the end of the description line, but I am trying to add the number addition directly to the end of the sequence ID number since adding it to the description didn't seem to help my error on makeblastdb.

Eg: I want: " >CP064824.1_1 Klebsiella pneumoniae subsp. pneumoniae strain K219 plasmid pIncFIB, complete sequence TGAACGTCACTGCCATCCTGCATTCTGAATGGCAGCATTATTTCTCTCTGACATCACGCCGTGCGTGTAA...." Instead of: ">CP064824.1 Klebsiella pneumoniae subsp. pneumoniae strain K219 plasmid pIncFIB, complete sequence_1 TGAACGTCACTGCCATCCTGCATTCTGAATGGCAGCATTATTTCTCTCTGACATCACGCCGTGCGTGTAA...."

Thanks!

ADD REPLY
0
Entering edit mode

Given input.fa:

>CP064824.1 Klebsiella pneumoniae
TGAACGTCACTGCCATCCTGCATTCTGAATGGCAGCATT
>ABC12345.6 Mycobacterium tuberculosis
ATTTCTCTCTGACATCACGCCGTGCGTGTAA
>CP064824.1 Klebsiella pneumoniae
GCATTCTGAATGGCAGCATTTGAACGTCACTGCCATC

Note the two Klebsiella pneumoniae entries.

The following will append an incremented counter to the sequence ID:

$ awk 'BEGIN{FS=" ";RS=">"}{if(NR>1){ a[$1]++; h=""; for(i=2; i<NF; i++) { h=h" "$i; } print ">"$1"_"a[$1]h"\n"$NF; }}' input.fa
>CP064824.1_1 Klebsiella pneumoniae
TGAACGTCACTGCCATCCTGCATTCTGAATGGCAGCATT
>ABC12345.6_1 Mycobacterium tuberculosis
ATTTCTCTCTGACATCACGCCGTGCGTGTAA
>CP064824.1_2 Klebsiella pneumoniae
GCATTCTGAATGGCAGCATTTGAACGTCACTGCCATC

For multiline FASTA, you'd need to make modifications, but hopefully this gives you some ideas.

ADD REPLY
0
Entering edit mode
7.4 years ago
venu 7.1k

You can do something like following

cat file.fa | paste - - | awk '{print $1"_"NR"\n"$2}' > new_file.fa
ADD COMMENT
0
Entering edit mode

I don't think it's a question, but that he is sharing a way of doing just this.

ADD REPLY
2
Entering edit mode

Oops! I think I was too hurry as I am busy with our Biostars Handbook.

ADD REPLY
0
Entering edit mode

Best excuse I can imagine, keep up the good work.

ADD REPLY
0
Entering edit mode

If you put this in the Handbook, make sure to use the OPs awk as it works with all FASTA files and not just ones where the sequence is less than 120 characters! :) (or don't put awk in at all, because awk is the devil! :P)

ADD REPLY
0
Entering edit mode

thanks for the new way to do this. ZQ

ADD REPLY

Login before adding your answer.

Traffic: 2681 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6