Extracting The Features From Genbank File
1
0
Entering edit mode
10.7 years ago

Hi I have a set of the genes (mouse) for which i would like to get the function, tissue specificity and other features. I have obtained a file (.dat) from uniprot which has all such information. but it also has unwanted information. The data seems to be in genbank format. how can i get the features of my interest. when i searched i got this site http://www.molbiol-tools.ca/Convert.htm and tried with gbk2ffn tool but it is showing some error. Not sure whether there is any other way to do the same. This is how my file appears

ID   1433B_MOUSE             Reviewed;         246 AA.
AC   Q9CQV8; O70455; Q3TY33; Q3UAN6;
DT   26-SEP-2001, integrated into UniProtKB/Swiss-Prot.
DT   23-JAN-2007, sequence version 3.
DT   24-JUL-2013, entry version 118.
DE   RecName: Full=14-3-3 protein beta/alpha;
DE   AltName: Full=Protein kinase C inhibitor protein 1;
DE            Short=KCIP-1;
DE   Contains:
DE     RecName: Full=14-3-3 protein beta/alpha, N-terminally processed;
GN   Name=Ywhab;
OS   Mus musculus (Mouse).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi;
OC   Muroidea; Muridae; Murinae; Mus; Mus.
OX   NCBI_TaxID=10090;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [MRNA].
RC   STRAIN=C57BL/6J;
RA   Karpitskiy V.V., Shaw A.S.;
RL   Submitted (APR-1998) to the EMBL/GenBank/DDBJ databases.
RN   [2]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RC   STRAIN=C57BL/6J;
RC   TISSUE=Bone marrow, Embryo, Kidney, Liver, Thymus, and Visual cortex;
RX   PubMed=16141072; DOI=10.1126/science.1112014;
RA   Carninci P., Kasukawa T., Katayama S., Gough J., Frith M.C., Maeda N.,
RA   Oyama R., Ravasi T., Lenhard B., Wells C., Kodzius R., Shimokawa K.,
RA   Bajic V.B., Brenner S.E., Batalov S., Forrest A.R., Zavolan M.,
RA   Davis M.J., Wilming L.G., Aidinis V., Allen J.E.,
RA   Ambesi-Impiombato A., Apweiler R., Aturaliya R.N., Bailey T.L.,
RA   Bansal M., Baxter L., Beisel K.W., Bersano T., Bono H., Chalk A.M.,
RA   Chiu K.P., Choudhary V., Christoffels A., Clutterbuck D.R.,
RA   Crowe M.L., Dalla E., Dalrymple B.P., de Bono B., Della Gatta G.,
RA   di Bernardo D., Down T., Engstrom P., Fagiolini M., Faulkner G.,
RA   Fletcher C.F., Fukushima T., Furuno M., Futaki S., Gariboldi M.,
RA   Georgii-Hemming P., Gingeras T.R., Gojobori T., Green R.E.,
RA   Gustincich S., Harbers M., Hayashi Y., Hensch T.K., Hirokawa N.,
RA   Hill D., Huminiecki L., Iacono M., Ikeo K., Iwama A., Ishikawa T.,
RA   Jakt M., Kanapin A., Katoh M., Kawasawa Y., Kelso J., Kitamura H.,
RA   Kitano H., Kollias G., Krishnan S.P., Kruger A., Kummerfeld S.K.,
RA   Kurochkin I.V., Lareau L.F., Lazarevic D., Lipovich L., Liu J.,
RA   Liuni S., McWilliam S., Madan Babu M., Madera M., Marchionni L.,
RA   Matsuda H., Matsuzawa S., Miki H., Mignone F., Miyake S., Morris K.,
RA   Mottagui-Tabar S., Mulder N., Nakano N., Nakauchi H., Ng P.,
RA   Nilsson R., Nishiguchi S., Nishikawa S., Nori F., Ohara O.,
RA   Okazaki Y., Orlando V., Pang K.C., Pavan W.J., Pavesi G., Pesole G.,
RA   Petrovsky N., Piazza S., Reed J., Reid J.F., Ring B.Z., Ringwald M.,
RA   Rost B., Ruan Y., Salzberg S.L., Sandelin A., Schneider C.,
RA   Schoenbach C., Sekiguchi K., Semple C.A., Seno S., Sessa L., Sheng Y.,
RA   Shibata Y., Shimada H., Shimada K., Silva D., Sinclair B.,
RA   Sperling S., Stupka E., Sugiura K., Sultana R., Takenaka Y., Taki K.,
RA   Tammoja K., Tan S.L., Tang S., Taylor M.S., Tegner J., Teichmann S.A.,
RA   Ueda H.R., van Nimwegen E., Verardo R., Wei C.L., Yagi K.,
RA   Yamanishi H., Zabarovsky E., Zhu S., Zimmer A., Hide W., Bult C.,
RA   Grimmond S.M., Teasdale R.D., Liu E.T., Brusic V., Quackenbush J.,
RA   Wahlestedt C., Mattick J.S., Hume D.A., Kai C., Sasaki D., Tomaru Y.,
RA   Fukuda S., Kanamori-Katayama M., Suzuki M., Aoki J., Arakawa T.,
RA   Iida J., Imamura K., Itoh M., Kato T., Kawaji H., Kawagashira N.,
RA   Kawashima T., Kojima M., Kondo S., Konno H., Nakano K., Ninomiya N.,
RA   Nishio T., Okada M., Plessy C., Shibata K., Shiraki T., Suzuki S.,
RA   Tagami M., Waki K., Watahiki A., Okamura-Oho Y., Suzuki H., Kawai J.,
RA   Hayashizaki Y.;
RT   "The transcriptional landscape of the mammalian genome.";
RL   Science 309:1559-1563(2005).
RN   [3]
RP   PROTEIN SEQUENCE OF 1-12; 14-57; 61-70; 84-117; 128-169; 196-246 AND
RP   215-224, AND MASS SPECTROMETRY.
RC   STRAIN=C57BL/6, and OF1; TISSUE=Brain, and Hippocampus;
RA   Lubec G., Kang S.U., Sunyer B., Chen W.-Q.;
RL   Submitted (JAN-2009) to UniProtKB.
RN   [4]
RP   PHOSPHORYLATION AT SER-60.
RX   PubMed=9705322; DOI=10.1074/jbc.273.34.21834;
RA   Megidish T., Cooper J., Zhang L., Fu H., Hakomori S.;
RT   "A novel sphingosine-dependent protein kinase (SDK1) specifically
RT   phosphorylates certain isoforms of 14-3-3 protein.";
RL   J. Biol. Chem. 273:21834-21845(1998).
RN   [5]
RP   NITRATION [LARGE SCALE ANALYSIS] AT TYR-84 AND TYR-106, AND MASS
RP   SPECTROMETRY.
RC   TISSUE=Brain;
RX   PubMed=16800626; DOI=10.1021/bi060474w;
RA   Sacksteder C.A., Qian W.-J., Knyushko T.V., Wang H., Chin M.H.,
RA   Lacan G., Melega W.P., Camp D.G. II, Smith R.D., Smith D.J.,
RA   Squier T.C., Bigelow D.J.;
RT   "Endogenously nitrated proteins in mouse brain: links to
RT   neurodegenerative disease.";
RL   Biochemistry 45:8009-8022(2006).
RN   [6]
RP   INTERACTION WITH PRKCE.
RX   PubMed=18604201; DOI=10.1038/ncb1749;
RA   Saurin A.T., Durgan J., Cameron A.J., Faisal A., Marber M.S.,
RA   Parker P.J.;
RT   "The regulated assembly of a PKCepsilon complex controls the
RT   completion of cytokinesis.";
RL   Nat. Cell Biol. 10:891-901(2008).
RN   [7]
RP   INTERACTION WITH SAMSN1.
RX   PubMed=20478393; DOI=10.1016/j.biocel.2010.05.004;
RA   Brandt S., Ellwanger K., Beuter-Gunia C., Schuster M., Hausser A.,
RA   Schmitz I., Beer-Hammer S.;
RT   "SLy2 targets the nuclear SAP30/HDAC1 complex.";
RL   Int. J. Biochem. Cell Biol. 42:1472-1481(2010).
CC   -!- FUNCTION: Adapter protein implicated in the regulation of a large
CC       spectrum of both general and specialized signaling pathways. Binds
CC       to a large number of partners, usually by recognition of a
CC       phosphoserine or phosphothreonine motif. Binding generally results
CC       in the modulation of the activity of the binding partner. Negative
CC       regulator of osteogenesis. Blocks the nuclear translocation of the
CC       phosphorylated form (by AKT1) of SRPK2 and antagonizes its
CC       stimulatory effect on cyclin D1 expression resulting in blockage
CC       of neuronal apoptosis elicited by SRPK2 (By similarity).
CC   -!- SUBUNIT: Homodimer, and heterodimer with YWHAG, YWHAE and YWHAQ.
CC       Interacts with SSH1 and TORC2/CRTC2. Interacts with GAB2 and YAP1
CC       (phosphorylated form) (By similarity). Interacts with SAMSN1.
CC       Interacts with PKA-phosphorylated AANAT (By similarity). Interacts
CC       with the phosphorylated (by AKT1) form of SRPK2 (By similarity).
CC       Interacts with PRKCE (phosphorylated form).
CC   -!- INTERACTION:
CC       Q5S006:Lrrk2; NbExp=3; IntAct=EBI-771608, EBI-2693710;
CC   -!- SUBCELLULAR LOCATION: Cytoplasm (By similarity). Melanosome (By
CC       similarity).
CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative initiation; Named isoforms=2;
CC       Name=Long;
CC         IsoId=Q9CQV8-1; Sequence=Displayed;
CC       Name=Short;
CC         IsoId=Q9CQV8-2; Sequence=VSP_018634;
CC         Note=No experimental confirmation available. Contains a
CC         N-acetylmethionine at position 1 (By similarity);
CC   -!- PTM: Isoform alpha differs from isoform beta in being
CC       phosphorylated (By similarity). Phosphorylated on Ser-60 by
CC       protein kinase C delta type catalytic subunit in a sphingosine-
CC       dependent fashion.
CC   -!- PTM: Isoform Short contains a N-acetylmethionine at position 1 (By
CC       similarity).
CC   -!- SIMILARITY: Belongs to the 14-3-3 family.
CC   -----------------------------------------------------------------------
CC   Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
CC   Distributed under the Creative Commons Attribution-NoDerivs License
CC   -----------------------------------------------------------------------
DR   EMBL; AF058797; AAC14343.1; -; mRNA.
DR   EMBL; AK002632; BAB22246.1; -; mRNA.
DR   EMBL; AK004872; BAB23631.1; -; mRNA.
DR   EMBL; AK011389; BAB27587.1; -; mRNA.
DR   EMBL; AK083367; BAC38886.1; -; mRNA.
DR   EMBL; AK144061; BAE25678.1; -; mRNA.
DR   EMBL; AK150414; BAE29538.1; -; mRNA.
DR   EMBL; AK151294; BAE30278.1; -; mRNA.
DR   EMBL; AK158932; BAE34730.1; -; mRNA.
DR   IPI; IPI00230682; -.
DR   IPI; IPI00760000; -.
DR   RefSeq; NP_061223.2; NM_018753.6.
DR   UniGene; Mm.34319; -.
DR   PDB; 4GNT; X-ray; 2.41 A; A=1-239.
DR   PDBsum; 4GNT; -.
DR   ProteinModelPortal; Q9CQV8; -.
DR   SMR; Q9CQV8; 2-232.
DR   IntAct; Q9CQV8; 612.
DR   MINT; MINT-1869492; -.
DR   PhosphoSite; Q9CQV8; -.
DR   UCD-2DPAGE; Q9CQV8; -.
DR   PaxDb; Q9CQV8; -.
DR   PRIDE; Q9CQV8; -.
DR   Ensembl; ENSMUST00000018470; ENSMUSP00000018470; ENSMUSG00000018326.
DR   GeneID; 54401; -.
DR   KEGG; mmu:54401; -.
DR   UCSC; uc008ntp.1; mouse.
DR   CTD; 7529; -.
DR   MGI; MGI:1891917; Ywhab.
DR   eggNOG; COG5040; -.
DR   GeneTree; ENSGT00710000106445; -.
DR   HOGENOM; HOG000240379; -.
DR   HOVERGEN; HBG050423; -.
DR   InParanoid; Q9CQV8; -.
DR   KO; K16197; -.
DR   OMA; CNDVLXT; -.
DR   OrthoDB; EOG4N30PR; -.
DR   Reactome; REACT_147847; Translocation of Glut4 to the Plasma Membrane.
DR   ChiTaRS; YWHAB; mouse.
DR   NextBio; 311260; -.
DR   ArrayExpress; Q9CQV8; -.
DR   Bgee; Q9CQV8; -.
DR   Genevestigator; Q9CQV8; -.
DR   GO; GO:0030659; C:cytoplasmic vesicle membrane; TAS:Reactome.
DR   GO; GO:0005829; C:cytosol; TAS:Reactome.
DR   GO; GO:0042470; C:melanosome; IEA:UniProtKB-SubCell.
DR   GO; GO:0048471; C:perinuclear region of cytoplasm; IEA:Compara.
DR   GO; GO:0017053; C:transcriptional repressor complex; IEA:Compara.
DR   GO; GO:0019904; F:protein domain specific binding; IDA:MGI.
DR   GO; GO:0003714; F:transcription corepressor activity; IEA:Compara.
DR   GO; GO:0051220; P:cytoplasmic sequestering of protein; IEA:Compara.
DR   GO; GO:0035308; P:negative regulation of protein dephosphorylation; IEA:Compara.
DR   GO; GO:0045892; P:negative regulation of transcription, DNA-dependent; IEA:Compara.
DR   GO; GO:0043085; P:positive regulation of catalytic activity; IEA:Compara.
DR   GO; GO:0051291; P:protein heterooligomerization; IEA:Compara.
DR   GO; GO:0006605; P:protein targeting; IDA:MGI.
DR   Gene3D; 1.20.190.20; -; 1.
DR   InterPro; IPR000308; 14-3-3.
DR   InterPro; IPR023409; 14-3-3_CS.
DR   InterPro; IPR023410; 14-3-3_domain.
DR   PANTHER; PTHR18860; PTHR18860; 1.
DR   Pfam; PF00244; 14-3-3; 1.
DR   PIRSF; PIRSF000868; 14-3-3; 1.
DR   PRINTS; PR00305; 1433ZETA.
DR   SMART; SM00101; 14_3_3; 1.
DR   SUPFAM; SSF48445; 14-3-3; 1.
DR   PROSITE; PS00796; 1433_1; 1.
DR   PROSITE; PS00797; 1433_2; 1.
PE   1: Evidence at protein level;
KW   3D-structure; Acetylation; Alternative initiation; Complete proteome;
KW   Cytoplasm; Direct protein sequencing; Nitration; Phosphoprotein;
KW   Reference proteome.
FT   CHAIN         1    246       14-3-3 protein beta/alpha.
FT                                /FTId=PRO_0000367902.
FT   INIT_MET      1      1       Removed; alternate (By similarity).
FT   CHAIN         2    246       14-3-3 protein beta/alpha, N-terminally
FT                                processed.
FT                                /FTId=PRO_0000000005.
FT   SITE         58     58       Interaction with phosphoserine on
FT                                interacting protein (By similarity).
FT   SITE        129    129       Interaction with phosphoserine on
FT                                interacting protein (By similarity).
FT   MOD_RES       1      1       N-acetylmethionine (By similarity).
FT   MOD_RES       2      2       N-acetylthreonine; in 14-3-3 protein
FT                                beta/alpha, N-terminally processed (By
FT                                similarity).
FT   MOD_RES      60     60       Phosphoserine.
FT   MOD_RES      70     70       N6-acetyllysine (By similarity).
FT   MOD_RES      84     84       Nitrated tyrosine.
FT   MOD_RES     106    106       Nitrated tyrosine.
FT   MOD_RES     117    117       N6-acetyllysine (By similarity).
FT   MOD_RES     186    186       Phosphoserine (By similarity).
FT   VAR_SEQ       1      2       Missing (in isoform Short).
FT                                /FTId=VSP_018634.
FT   CONFLICT     10     10       Q -> H (in Ref. 1; AAC14343).
FT   CONFLICT     74     74       N -> D (in Ref. 1; AAC14343).
FT   CONFLICT    126    126       D -> Y (in Ref. 2; BAE29538/BAE30278).
FT   HELIX         5     17
FT   HELIX        21     33
FT   HELIX        40     68
FT   HELIX        75    105
FT   HELIX       107    110
FT   HELIX       114    134
FT   HELIX       137    161
FT   HELIX       167    182
FT   HELIX       187    203
FT   HELIX       205    207
FT   TURN        210    212
FT   HELIX       213    230
SQ   SEQUENCE   246 AA;  28086 MW;  51C366ED85B38EED CRC64;
     MTMDKSELVQ KAKLAEQAER YDDMAAAMKA VTEQGHELSN EERNLLSVAY KNVVGARRSS
     WRVISSIEQK TERNEKKQQM GKEYREKIEA ELQDICNDVL ELLDKYLILN ATQAESKVFY
     LKMKGDYFRY LSEVASGENK QTTVSNSQQA YQEAFEISKK EMQPTHPIRL GLALNFSVFY
     YEILNSPEKA CSLAKTAFDE AIAELDTLNE ESYKDSTLIM QLLRDNLTLW TSENQGDEGD
     AGEGEN
//

There are many such entries in the single file. I want to get only specific features for a set of genes seperately

genbank feature extraction • 5.4k views
ADD COMMENT
0
Entering edit mode

provide what specific features are you looking for?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

It's not a GenBank file; it's from UniProt.

ADD REPLY
0
Entering edit mode

i didnot paid much attention to the file: nice catch

ADD REPLY
2
Entering edit mode
10.7 years ago
Neilfws 49k

This is not a GenBank file. You obtained it from UniProt so it's a UniProt (often called SwissProt) file.

Extraction of specific features from a larger file is called parsing. There are software libraries available in all of the major programming languages to parse this kind of file. See, for example, links at this Bioperl page. However, it sounds like you are not currently a programmer.

Another option is to use command line tools (grep, awk, cut, paste and so on). For example to get only the lines with tissue information, you might try something like:

grep -iP "^RC\s+TISSUE=" myfile.dat

which says "print lines that begin with RC, followed by one or more spaces, followed by TISSUE=".

There are plenty of tutorials on the Web for these tools.

ADD COMMENT
1
Entering edit mode

While you can use BioPerl to extract some information from a UniProtKB flat-file entry, more complete coverage of the entry structure can be found in Swissknife.

Depending on the information required, it may be easier to use the custom table export support available in UniProt.org to get selected data fields in a easier to process format. Alternatively UniProtKB is also available from the UniProt BioMart or, if you want to use Java, via UniProtJAPI a Java API to the UniProtKB data.

If you are happy working with XML or RDF/XML format data those are also options available for UniProtKB data exported from UniProt.org, and provide improved granularity and some additional information compared to the flat-file format.

ADD REPLY
0
Entering edit mode

it is not working

ADD REPLY

Login before adding your answer.

Traffic: 1590 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6