Question

Adding Unique Identifiers To Fasta Headers

1

Entering edit mode

10.4 years ago

bioinfo ▴ 830

I have 5000 FASTA sequences with Uniprot ids. Now, I want to add a unique identifier at the beginning of each FASTA header. An example will explain you a bit better.

>sp|A12345|ref|COG_REF human ribosomal protein
-------------------------------------------------------------

>tr|B57384|ref|DRF_ERG ribosomal protein 
-------------------------------------------------------------

And so on

I want to add ABC0001 to ABC5000 at the beginning of the fasta header. And the corresponding gene name from my txt file.

gopA  ABC0001 A12345
gopD  ABC0002 B57384
........................
fotR   ABC5000  C12345

Output:

>ABC0001|gopA|sp|A12345|ref|COG_REF human ribosomal protein
    -------------------------------------------------------------

>ABC0002|gopD|tr|B57384|ref|DRF_ERG ribosomal protein 
    -------------------------------------------------------------

    And so on

As I understand, I have to match the uniprot IDs from the txt file to the FASTA sequence file and then grab the ABC ids ( e.g. ABC0001) and Gene name (gopA) and add them at the beginning of the FASTA header.

fasta perl awk • 5.1k views

ADD COMMENT • link updated 10.4 years ago by Biostar 20 • written 10.4 years ago by bioinfo ▴ 830

0

Entering edit mode

What have you tried?

ADD REPLY • link 10.4 years ago by Kenosis ★ 1.3k

0

Entering edit mode

I have modified the txt file in a better way

less uniprotid_ABCid_genename.txt| sed 's/\t/|/g'

output>

P87546|ABC0447|yohF
P12345|ABC0448|pogR

Now trying to match –f1 (P87546) and replacing the Uniprot id (P87546) with “P87546|ABC0447|yohF” in FASTA file below. Later I can change the order of the ids in the FASTA header to >ABC0447|yohF|tr|P87546.

 tr|P87546|ref|TRG_DEF ribosomal protein
   ..................................................................

I don’t know any one liner solution that can do it. Should I try in perl by getting each id and then loop through the FASTA sequences followed by matching ID to header. I just added non-unique identifier to each sequence but that’s not what I want. I’m really close, need a bit more time I guess.

less file.fasta | sed 's/^>/>ABC0001|gopT|/g'| less

>ABC0001|gopT|tr|B5LX01|B123451_ABCFF OS=Campylobacter jejuni GN=xxx PE=4 SV=1
...............................................

ADD REPLY • link 10.4 years ago by bioinfo ▴ 830

0

Entering edit mode

Should I try in perl by getting each id and then loop through the FASTA sequences followed by matching ID to header. Yes--you will need to load your gene names first, then apply a substitution to the header. If you don't mind, I'll post a solution candidate...

ADD REPLY • link 10.4 years ago by Kenosis ★ 1.3k

0

Entering edit mode

Join could probably do this. General syntax:

    join -1 1 -2 1 -o 1.3,2.1 <(sort -k1,1 file1) <(sort -k2,2 file2) > output

Would join based on file1 column 1 and file2 column2, and output file1 column 3 file2 column1. You might first have to strip the fasta headers from the fasta file. After making new headers you can replace the old ones with the new ones, again with join.

ADD REPLY • link 10.4 years ago by 5heikki 11k

score 1 · Answer 1 · 2013-11-07

1

Entering edit mode

10.4 years ago

Kenosis ★ 1.3k

I think your comment about processing the gene names first, then the fasta file is on target. Given this, consider the following:

use strict;
use warnings;

my $fasta = pop;
my %hash;

while (<>) {
    my @x = split;
    $hash{ $x[2] } = "$x[1]|$x[0]";
}

push @ARGV, $fasta;

while (<>) {
    s/>(\w+)\|(\w+)/>$hash{$2}|$1|$2/;
    print;
}

Usage: perl script.pl geneNames fastaFile [>outFile]

The last, optional parameter directs output to a file.

Output on your dataset:

>ABC0001|gopA|sp|A12345|ref|COG_REF human ribosomal protein
-------------------------------------------------------------

>ABC0002|gopD|tr|B57384|ref|DRF_ERG ribosomal protein 
-------------------------------------------------------------

The script first (implicitly) pops the fasta file name off of @ARGV. It then processes the geneName lines, using the ID as a hash key and the other, combined items as the assocciated value. Next, the fasta file's processed, the ID's captured, and a substitution is made using the hash value.

Hope this helps!

Edit: Added |$2 to the substitution.

ADD COMMENT • link 10.4 years ago by Kenosis ★ 1.3k

0

Entering edit mode

my @x = split;
    $hash{ $x[2] } = "$x[1]|$x[0]";

here, you mean split by "|" of the GeneNames txt file (as P123456|gopA|ABC0001 single column format?) and then hash them as gopA = ABC0001|P123456. I will give a test run.

ADD REPLY • link 10.4 years ago by bioinfo ▴ 830

0

Entering edit mode

my @x = split; splits your gene name data on whitespace, and "$x[1]|$x[0]" creates a string (the value associated to an ID as the key) that uses the pipe character, like in the header. Thus, the line "gopA ABC0001 A12345" would become { A12345 => 'ABC0001|gopA' }.

In case you didn't notice my Edit remark at the end of the solution, I added |$2 to the substitution, as I had inadvertently left out part of the header.

ADD REPLY • link 10.4 years ago by Kenosis ★ 1.3k

0

Entering edit mode

perfect. I realised that the genenames txt file should be in this order.

corT    ABC0447    P12345

I just added the ID ($2) as well to get the exact output.

while (<>) {
    s/>(\w+)\|(\w+)/>$hash{$2}|$1|$2/;
    print;

I really appreciate for your help.

ADD REPLY • link 10.4 years ago by bioinfo ▴ 830

0

Entering edit mode

You're most welcome, C. Pal!

ADD REPLY • link 10.4 years ago by Kenosis ★ 1.3k