Biostar Beta. Not for public use.
Question: How to rename duplicate fasta headers after the first underscore '_'?
1
Entering edit mode

Hi,

I have a fasta file input.fa which has duplicate fasta headers.

    >UID584_Org_name_strain
    JGTTQEWEILYNMCCDASSHHGTQEFGRKKQLAADBNGTAVQEGFN
    >JGH1236_Org_name
    HFHGTSNNIGLTTREKKQLAADBNGTAVQEGFNMENDLALFVTYGAHLVDGNVLTDRLSIGGKTALTGVDP
    >JGH1236_Org_name
    FHGTSNNIGLTTREKKQLAADBNGTAVQEGFNMENDLALFVTYGAHLVDGNVLTDRLSIGGKTALTGV
    >KIL563.2_Org_name
    TTFWLMAFDSCVIIPPTREWWQQLGTGTSNNIGLTTREKKQLAADBN
    >KIL563.2_Org_name
    TTFWLMAFDSCVIIPPTREWWQQLGTGTSEKKQLAADBNGTAVQEGFNMENDLALFVTYGAHLV
    >KIL563.2_Org_name
    TTFWLMAFDSCVIIPPTREWWQQLGTGTSEKKQLAADBNGTAVQEGFNMENDLALFVTY
    >GTK584_Org_name_str
    KKQLAADBNGTAVQEGFNMENDLALFVTYGHFHGTSNNIGLTTREKKQLAAWLMAFDSCVIIPPTRE
    >GTK584_Org_name_str
    DBNGTAVQEGFNMENDLALFVTYGHFHGTSNNIGLTTREKKQLAAWLMAFDSCVIIPPTREADBNGTAVQEGFNME
    >EAD5624_Org_nam
    LTTREKKQLAADBNGTAVQEGFNMENDLALFVTYBNGTAVQEGFNMENDLALFVTYGAAWLMAFDSCVIIPPT
    >EAD5624_Org_nam
    LTTREKKQLAADBNGTAVQEGFNMENDLALFVTYBNGTAVQVIIPPTREWWQQLGTGTSEKKQLAADBNGTAVQEGF

I want to rename all the headers with numbers before the first underscore '_' like this:

    >UID584.1_Org_name_strain
    JGTTQEWEILYNMCCDASSHHGTQEFGRKKQLAADBNGTAVQEGFN
    >JGH1236.1_Org_name
    HFHGTSNNIGLTTREKKQLAADBNGTAVQEGFNMENDLALFVTYGAHLVDGNVLTDRLSIGGKTALTGVDP
    >JGH1236.2_Org_name
    FHGTSNNIGLTTREKKQLAADBNGTAVQEGFNMENDLALFVTYGAHLVDGNVLTDRLSIGGKTALTGV
    >KIL563.2.1_Org_name
    TTFWLMAFDSCVIIPPTREWWQQLGTGTSNNIGLTTREKKQLAADBN
    >KIL563.2.2_Org_name
    TTFWLMAFDSCVIIPPTREWWQQLGTGTSEKKQLAADBNGTAVQEGFNMENDLALFVTYGAHLV
    >KIL563.2.3_Org_name
    TTFWLMAFDSCVIIPPTREWWQQLGTGTSEKKQLAADBNGTAVQEGFNMENDLALFVTY
    >GTK584.1_Org_name_str
    KKQLAADBNGTAVQEGFNMENDLALFVTYGHFHGTSNNIGLTTREKKQLAAWLMAFDSCVIIPPTRE
    >GTK584.2_Org_name_str
    DBNGTAVQEGFNMENDLALFVTYGHFHGTSNNIGLTTREKKQLAAWLMAFDSCVIIPPTREADBNGTAVQEGFNME
    >EAD5624.1_Org_nam
    LTTREKKQLAADBNGTAVQEGFNMENDLALFVTYBNGTAVQEGFNMENDLALFVTYGAAWLMAFDSCVIIPPT
    >EAD5624.2_Org_nam
    LTTREKKQLAADBNGTAVQEGFNMENDLALFVTYBNGTAVQVIIPPTREWWQQLGTGTSEKKQLAADBNGTAVQEGF

It would be convenient if anybody could suggest me sed/awk/grep command for this. Any help would be appreciated. Thanks!

ADD COMMENTlink 11 months ago MB • 20 • updated 11 months ago Carambakaracho ♦ 1.2k
2
Entering edit mode

This will fail if the sequences are not ordered as in your example:

awk 'BEGIN{OFS=FS="_"}{if(/^>/){CUR=$1;{if(CUR==PRE){NUM++}else{NUM=1}};$1="";print CUR"."NUM $0;PRE=CUR}else{print $0}}' in > out
ADD COMMENTlink 11 months ago 5heikki 8.4k
Entering edit mode
0

Thanks! worked perfectly!

ADD REPLYlink 11 months ago
MB
• 20
2
Entering edit mode

This work, too. No order on sequences required. perl, though...

perl -ne 'if (/^>/){@a=split /_/; $h{$a[0]}++; $a[0].= ".".$h{$a[0]}; $s=join "_", @a; print $s;}else{print $_;}' <in >out
ADD COMMENTlink 11 months ago Carambakaracho ♦ 1.2k
Entering edit mode
0

Thanks, worked like a charm!

ADD REPLYlink 11 months ago
MB
• 20

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0