Question

Tutorial:Add count numbers to headers in a fasta file

1

Entering edit mode

7.4 years ago

wu.zhiqiang.1020 ▴ 50

>a
ACTCTAAAT

>b
AAAAACCCT

etc.

To

>a_1
ACTCTAAAT

>b_2
AAAAACCCT

awk '/^>/{$0=$0"_"(++i)}1'  in > out

genome • 5.7k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 7.4 years ago by wu.zhiqiang.1020 ▴ 50

0

Entering edit mode

Could you expand a bit on your post? What's the purpose of doing this?

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Soory, I just want to record this. thanks for you concern, I will explain more for next time.

ZQ

ADD REPLY • link 7.4 years ago by wu.zhiqiang.1020 ▴ 50

score 3 · Answer 1 · 2016-12-13

3

Entering edit mode

7.4 years ago

Alex Reynolds 35k

Another way to do it, which works with single-line FASTA input:

$ awk 'BEGIN{RS=">"}{if(NR>1)print ">"$1"_"(NR-1)"\n"$2}' input.fa > output.fa

A second way, which allows multiline FASTA input:

$ awk 'BEGIN{RS=">";OFS="\n"}(NR>1){print ">"$1"_"(NR-1)"\n";$1="";print $0}' input.fa | awk '$0' > output.fa

ADD COMMENT • link 7.4 years ago by Alex Reynolds 35k

0

Entering edit mode

Hi - I know this post is quite old now, but I tried the above code for the multi-line fasta and it produced a fasta file that had only the names and then the number, but none of the sequences. I would like to add the numbers to the headers, but keeping the sequences intact in the output file - Is there any way to do this? I have been trying to resolve a makeblastdb error saying I have duplicate seqids and I was hoping this approach might resolve the error.

Edit: I tried the line from Alex Reynolds: awk '/^>/{$0=$0"_"(++i)}1' in > out
And it successfully added a number at the end of the description line, but I am trying to add the number addition directly to the end of the sequence ID number since adding it to the description didn't seem to help my error on makeblastdb.

Eg: I want: " >CP064824.1_1 Klebsiella pneumoniae subsp. pneumoniae strain K219 plasmid pIncFIB, complete sequence TGAACGTCACTGCCATCCTGCATTCTGAATGGCAGCATTATTTCTCTCTGACATCACGCCGTGCGTGTAA...." Instead of: ">CP064824.1 Klebsiella pneumoniae subsp. pneumoniae strain K219 plasmid pIncFIB, complete sequence_1 TGAACGTCACTGCCATCCTGCATTCTGAATGGCAGCATTATTTCTCTCTGACATCACGCCGTGCGTGTAA...."

Thanks!

ADD REPLY • link 3.4 years ago by aroebz • 0

0

Entering edit mode

Given input.fa:

>CP064824.1 Klebsiella pneumoniae
TGAACGTCACTGCCATCCTGCATTCTGAATGGCAGCATT
>ABC12345.6 Mycobacterium tuberculosis
ATTTCTCTCTGACATCACGCCGTGCGTGTAA
>CP064824.1 Klebsiella pneumoniae
GCATTCTGAATGGCAGCATTTGAACGTCACTGCCATC

Note the two Klebsiella pneumoniae entries.

The following will append an incremented counter to the sequence ID:

$ awk 'BEGIN{FS=" ";RS=">"}{if(NR>1){ a[$1]++; h=""; for(i=2; i<NF; i++) { h=h" "$i; } print ">"$1"_"a[$1]h"\n"$NF; }}' input.fa
>CP064824.1_1 Klebsiella pneumoniae
TGAACGTCACTGCCATCCTGCATTCTGAATGGCAGCATT
>ABC12345.6_1 Mycobacterium tuberculosis
ATTTCTCTCTGACATCACGCCGTGCGTGTAA
>CP064824.1_2 Klebsiella pneumoniae
GCATTCTGAATGGCAGCATTTGAACGTCACTGCCATC

For multiline FASTA, you'd need to make modifications, but hopefully this gives you some ideas.

ADD REPLY • link 3.0 years ago by Alex Reynolds 35k

score 0 · Answer 2 · 2016-12-13

0

Entering edit mode

7.4 years ago

venu 7.1k

You can do something like following

cat file.fa | paste - - | awk '{print $1"_"NR"\n"$2}' > new_file.fa

ADD COMMENT • link 7.4 years ago by venu 7.1k

0

Entering edit mode

I don't think it's a question, but that he is sharing a way of doing just this.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

2

Entering edit mode

Oops! I think I was too hurry as I am busy with our Biostars Handbook.

ADD REPLY • link 7.4 years ago by venu 7.1k

0

Entering edit mode

Best excuse I can imagine, keep up the good work.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

If you put this in the Handbook, make sure to use the OPs awk as it works with all FASTA files and not just ones where the sequence is less than 120 characters! :) (or don't put awk in at all, because awk is the devil! :P)