Append number to the fasta sequence header with the times sequence has been repeated
2
0
Entering edit mode
5.2 years ago
Gene-ticks ▴ 10

Hi All, I need some help with programming this little problem (either in perl or python) which I am not very familiar with. Is there a way to append number to a duplicate sequence name(like Xtimes?) to the very first duplicate sequence and discard the other duplicates.

>name_1_1
AGGGTTT
>name_:2:_X
GTTTGAA
>name_:3:_Y
GTTTGAA

Result I want :

>name_1_1
AGGGTTT
>name_:2:_X_2times
GTTTGAA
perl python fasta • 1.7k views
ADD COMMENT
1
Entering edit mode

mirDeep (Perl code) does exactly this. It is quite simple, just build a hash with sequences as key, and names as values. If a key already exist, instead of assigning the sequence name as value, append a counter.

ADD REPLY
0
Entering edit mode
5.2 years ago

fasta are linearized. join two process: first process extract the DNA and count them. The second process sort the linearized fasta on the DNA.

 join -t $'\t' -1 2 -2 2 <(awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' jeter.fa | cut -f 2 | uniq -c | awk '{printf("%s\t%s\n",$1,$2);}' ) <(awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' jeter.fa | sort  -t $'\t' -u -k2,2) | awk '{printf("%s_%s\n%s\n",$3,$2,$1);}'
>name_1_1_1
AGGGTTT
>name_:2:_X_2
GTTTGAA
ADD COMMENT
0
Entering edit mode
5.2 years ago
gb ★ 2.2k

Dont know if you necessarily need to program it yourself but I use vsearch --derep_fulllength for this, easy and fast

ADD COMMENT

Login before adding your answer.

Traffic: 2018 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6