Question

Converting Uniprot File to a Fasta File in Perl

0

Entering edit mode

7.2 years ago

kingkajc1 • 0

Hi fam,

I have an uniprot file that I need to run through and parse certain lines. Those certain lines have values that I need to construct a fasta format file. Here is an example uniprot file:

ID   ARF1_PLAFA              Reviewed;         181 AA.
AC   Q94650; O02502; O02593;
DT   15-JUL-1998, integrated into UniProtKB/Swiss-Prot.
DT   23-JAN-2007, sequence version 3.
DT   25-NOV-2008, entry version 52.
DE   RecName: Full=ADP-ribosylation factor 1;
GN   Name=ARF1; Synonyms=ARF, PLARF;
OS   Plasmodium falciparum.
OC   Eukaryota; Alveolata; Apicomplexa; Aconoidasida; Haemosporida;
OC   Plasmodium; Plasmodium (Laverania).
OX   NCBI_TaxID=5833;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA].
RC   STRAIN=T9/96; TISSUE=Blood;
RX   MEDLINE=97112480; PubMed=8954160;
RX   DOI=10.1111/j.1432-1033.1996.0104r.x;
RA   Stafford W.H., Stockley R.W., Ludbrook S.B., Holder A.A.;
RT   "Isolation, expression and characterization of the gene for an AD
+P-
RT   ribosylation factor from the human malaria parasite, Plasmodium
RT   falciparum.";
RL   Eur. J. Biochem. 242:104-113(1996).
RN   [2]
RP   NUCLEOTIDE SEQUENCE [MRNA].
RX   MEDLINE=97237566; PubMed=9084044; DOI=10.1016/S0166-6851(96)02803
+-4;
RA   Truong R.M., Francis S.E., Chakrabarti D., Goldberg D.E.;
RT   "Cloning and characterization of Plasmodium falciparum ADP-
RT   ribosylation factor and factor-like genes.";
RL   Mol. Biochem. Parasitol. 84:247-253(1997).
CC   -!- FUNCTION: GTP-binding protein that functions as an allosteric
CC       activator of the cholera toxin catalytic subunit, an ADP-
CC       ribosyltransferase. Involved in protein trafficking; may modu
+late
CC       vesicle budding and uncoating within the Golgi apparatus (By
CC       similarity).
CC   -!- SUBCELLULAR LOCATION: Golgi apparatus (By similarity).
CC   -!- SIMILARITY: Belongs to the small GTPase superfamily. Arf fami
+ly.
CC   -----------------------------------------------------------------
+------
CC   Copyrighted by the UniProt Consortium, see http://www.uniprot.org
+/terms
CC   Distributed under the Creative Commons Attribution-NoDerivs Licen
+se
CC   -----------------------------------------------------------------
+------
DR   EMBL; Z80359; CAB02498.1; -; Genomic_DNA.
DR   EMBL; U57370; AAB63304.1; -; mRNA.
DR   HSSP; P32889; 1RRF.
DR   SMR; Q94650; 6-179.
DR   GO; GO:0005794; C:Golgi apparatus; IEA:UniProtKB-KW.
DR   GO; GO:0005525; F:GTP binding; IEA:InterPro.
DR   GO; GO:0015031; P:protein transport; IEA:UniProtKB-KW.
DR   GO; GO:0007264; P:small GTPase mediated signal transduction; IEA:
+InterPro.
DR   GO; GO:0016192; P:vesicle-mediated transport; IEA:UniProtKB-KW.
DR   InterPro; IPR006688; ARF.
DR   InterPro; IPR006689; ARF/SAR.
DR   InterPro; IPR001806; Ras_trnsfrmng.
DR   InterPro; IPR005225; Small_GTP_bd.
DR   PANTHER; PTHR11711; ARF/SAR; 1.
DR   Pfam; PF00025; Arf; 1.
DR   PRINTS; PR00449; RASTRNSFRMNG.
DR   PRINTS; PR00328; SAR1GTPBP.
DR   SMART; SM00177; ARF; 1.
DR   TIGRFAMs; TIGR00231; small_GTP; 1.
DR   PROSITE; PS01019; ARF; 1.
PE   2: Evidence at transcript level;
KW   ER-Golgi transport; Golgi apparatus; GTP-binding; Lipoprotein;
KW   Myristate; Nucleotide-binding; Protein transport; Transport.
FT   INIT_MET      1      1       Removed (Potential).
FT   CHAIN         2    181       ADP-ribosylation factor 1.
FT                                /FTId=PRO_0000207447.
FT   NP_BIND      24     31       GTP (By similarity).
FT   NP_BIND      67     71       GTP (By similarity).
FT   NP_BIND     126    129       GTP (By similarity).
FT   LIPID         2      2       N-myristoyl glycine (Potential).
SQ   SEQUENCE   181 AA;  20912 MW;  18013B069BEA2413 CRC64;
 MGLYVSRLFN RLFQKKDVRI LMVGLDAAGK TTILYKVKLG EVVTTIPTIG FNVETVEFRN
 ISFTVWDVGG QDKIRPLWRH YYSNTDGLIF VVDSNDRERI DDAREELHRM INEEELKDAI
 ILVFANKQDL PNAMSAAEVT EKLHLNTIRE RNWFIQSTCA TRGDGLYEGF DWLTTHLNNA
 K

I need to use regex and select the values of the "AC" line, "OS" line, "OX" line, "ID" line, "GN" line, "SQ" line and construct the fasta format which should look like this. The first line of the fasta format consists of the values from the line headings parsed from the uniprot file and are separated by "|".

>NM_012514 | Rattus norvegicus | breast cancer 1 (Brca1) | mRNA
 CGCTGGTGCAACTCGAAGACCTATCTCCTTCCCGGGGGGGCTTCTCCGGCATTTAGGCCT
 CGGCGTTTGGAAGTACGGAGGTTTTTCTCGGAAGAAAGTTCACTGGAAGTGGAAGAAATG
 GATTTATCTGCTGTTCGAATTCAAGAAGTACAAAATGTCCTTCATGCTATGCAGAAAATC
 TTGGAGTGTCCAATCTGTTTGGAACTGATCAAAGAACCGGTTTCCACACAGTGCGACCAC
 ATATTTTGCAAATTTTGTATGCTGAAACTCCTTAACCAGAAGAAAGGACCTTCCCAGTGT
 CCTTTGTGTAAGAATGAGATAACCAAAAGGAGCCTACAAGGAAGTGCAAGG

some code I have so far:

#!/usr/bin/perl
use warnings;
use strict;

unless (open(UNIPROT, "<", "uniprotfile")) {
 die "Unable to open uniprot file", $!;
}

while (<UNIPROT>) {
 my $lines = $_;
 if ($lines =~ /^AC(.*)|^OS(.*)|^OX(.*)|^ID(.*)|^GN(.*)|^SQ(.*)/) {
    print "> headline ", " | " $1, " | ", $2, " | ", $3, " | ", $4, " | ", $5 " | ", "\n";
    print $6, "\n";
  }

What am I missing?? Any help would be great! Thanks!

uniprot fasta perl regex • 5.0k views

ADD COMMENT • link 7.2 years ago by kingkajc1 • 0

4

Entering edit mode

When asking for a specific solution (e.g. in Perl) you should clarify if that requirement is because this is an assignment you are working on. In that case, people expect to see effort on your part to produce some code (which may be not working and that is ok) before they will step in and assist.

ADD REPLY • link 7.2 years ago by GenoMax 141k

2

Entering edit mode

Use BioPerl so each feature can be accessed without having to parse plain text.

ADD REPLY • link 7.2 years ago by Ram 43k

score 3 · Answer 1 · 2017-02-22

3

Entering edit mode

7.2 years ago

ALchEmiXt ★ 1.9k

Without writing the full answer from this tablet... bioperl might be the best way to go. like commented.

However if you do want to do it line by line and performance doesn't seem to be an issue.... approach it as follows:

1) read the file line by line like you do.

2) first check for each id like AC one by one and if found either store it in a variable and process it later or extract the required data with another regexp immediate and keep it in a variable.

3) do so for every ID on every line.

4) until you reach a positive SEQ id since from there you will have to do some multiple line grabs to get the entire sequence.

5) write out the fasta format using the stored values.

ADD COMMENT • link 7.2 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

isn't that what I'm doing in my code? I'm not sure why it isn't working..

ADD REPLY • link 7.2 years ago by kingkajc1 • 0

1

Entering edit mode

no actually your regexp grab tries to catch everything at once. This is not possible for the multiple line data after the SEQ element. There you need to continue grabbing lines until either a end delimiter like // or till EOF.

ADD REPLY • link 7.2 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

PS: I assume you just provided an example FASTA right? Since that FASTA is nucleotide space and not protein like the Swissprot. If you need nucleotide you need to follow accessions to a different DB type.

ADD REPLY • link 7.2 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

yea it was just an example. so this is what i have so far:

   if ($lines =~ /^AC\s+(.*)\;|^OS\s+(.*)|^OX\s+(.*)|^ID\s+(.*)|^GN\s+(.*)/) {
    print  $1, $2, $3, $4, $5, "\n";

but my output results in this:

Use of uninitialized value $1 in print at ./file.pl line 11, <UNIPROT> line 1.
Use of uninitialized value $2 in print at ./file.pl line 11, <UNIPROT> line 1.
Use of uninitialized value $3 in print at ./file.pl line 11, <UNIPROT> line 1.
Use of uninitialized value $5 in print at ./file.pl line 11, <UNIPROT> line 1.
CERU_HUMAN     STANDARD;      PRT;  1065 AA.
Use of uninitialized value $2 in print at ./file.pl line 11, <UNIPROT> line 2.
Use of uninitialized value $3 in print at ./file.pl line 11, <UNIPROT> line 2.
Use of uninitialized value $4 in print at ./file.pl line 11, <UNIPROT> line 2.
Use of uninitialized value $5 in print at ./file.pl line 11, <UNIPROT> line 2.
P00450; Q14063
Use of uninitialized value $1 in print at ./file.pl line 11, <UNIPROT> line 7.
Use of uninitialized value $2 in print at ./file.pl line 11, <UNIPROT> line 7.
Use of uninitialized value $3 in print at ./file.pl line 11, <UNIPROT> line 7.
Use of uninitialized value $4 in print at ./file.pl line 11, <UNIPROT> line 7.
CP.
Use of uninitialized value $1 in print at ./file.pl line 11, <UNIPROT> line 8.
Use of uninitialized value $3 in print at ./file.pl line 11, <UNIPROT> line 8.
Use of uninitialized value $4 in print at ./file.pl line 11, <UNIPROT> line 8.
Use of uninitialized value $5 in print at ./file.pl line 11, <UNIPROT> line 8.
Homo sapiens (Human).
Use of uninitialized value $1 in print at ./file.pl line 11, <UNIPROT> line 11.
Use of uninitialized value $2 in print at ./file.pl line 11, <UNIPROT> line 11.
Use of uninitialized value $4 in print at ./file.pl line 11, <UNIPROT> line 11.
Use of uninitialized value $5 in print at ./file.pl line 11, <UNIPROT> line 11.
NCBI_TaxID=9606;

I'm not exactly sure where i'm making the mistake..does the "or |" part mess up the loop?

ADD REPLY • link 7.2 years ago by kingkajc1 • 0

score 3 · Answer 2 · 2017-02-23

To build the "official" UniProtKB FASTA headers (http://www.uniprot.org/help/fasta-headers) you can also use the following code along with the Swissknife PERL module (http://swissknife.sourceforge.net/docs/):

# Purpose: 
# Read a file in SP format, write it in FASTA format.
#
# Usage:
# sp_to_fasta SP_file > FASTA_file

use strict;

use IO::File;

use SWISS::Entry;

my $inputfile = @ARGV[0];
my $fh = new IO::File $inputfile or 
    die "Cannot open input file $inputfile: $!";


    $/ = "\n\/\/";
    while(<$fh>) {
        s/\r//g;
        (my $entry_txt = $_) =~ s/^\s+//;
        next unless $entry_txt;
        $entry_txt .= "\n";
        my $entry = SWISS::Entry->fromText( $entry_txt );
        print $entry->toFasta();
    }