write a script in perl to print locus tag of a particular product and also product?
0
0
Entering edit mode
6.3 years ago
bioinfo&me • 0

I have a whole genome text file from which i want to extract only a particular product and locus tag of that product. I have to write a script in perl for the same. for exampe, i have a genome text file which looks like -

LOCUS NC_000962 4411532 bp DNA linear CON 14-DEC-2017

DEFINITION Mycobacterium tuberculosis H37Rv, complete genome.

ACCESSION NC_000962

VERSION NC_000962.3

DBLINK BioProject: PRJNA57777 Assembly: GCF_000195955.2

KEYWORDS RefSeq; complete genome.

SOURCE Mycobacterium tuberculosis H37Rv

ORGANISM Mycobacterium tuberculosis H37Rv Bacteria; Actinobacteria; Corynebacteriales; Mycobacteriaceae; Mycobacterium; Mycobacterium tuberculosis complex.

REFERENCE 1

AUTHORS Lew,J.M., Kapopoulou,A., Jones,L.M. and Cole,S.T.

TITLE TubercuList--10 years after

JOURNAL Tuberculosis (Edinb) 91 (1), 1-7 (2011)

PUBMED 20980199

REFERENCE 2

AUTHORS Camus,J.C., Pryor,M.J., Medigue,C. and Cole,S.T.

TITLE Re-annotation of the genome sequence of Mycobacterium tuberculosis H37Rv

JOURNAL Microbiology (Reading, Engl.) 148 (PT 10), 2967-2973 (2002)

PUBMED 12368430

REFERENCE 3

AUTHORS Cole,S.T., Brosch,R., Parkhill,J., Garnier,T., Churcher,C., Harris,D., Gordon,S.V., Eiglmeier,K., Gas,S., Barry,C.E. III, Tekaia,F., Badcock,K., Basham,D., Brown,D., Chillingworth,T., Connor,R., Davies,R., Devlin,K., Feltwell,T., Gentles,S., Hamlin,N., Holroyd,S., Hornsby,T., Jagels,K., Krogh,A., McLean,J., Moule,S., Murphy,L., Oliver,K., Osborne,J., Quail,M.A., Rajandream,M.A., Rogers,J., Rutter,S., Seeger,K., Skelton,J., Squares,R., Squares,S., Sulston,J.E., Taylor,K., Whitehead,S. and Barrell,B.G. TITLE Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence

JOURNAL Nature 393 (6685), 537-544 (1998)

PUBMED 9634230

REMARK Erratum:[Nature 1998 Nov 12;396(6707):190]

REFERENCE 4 (bases 1 to 4411532)

CONSRTM NCBI Genome Project

TITLE Direct Submission

JOURNAL Submitted (06-FEB-2013) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA

REFERENCE 5 (bases 1 to 4411532)

AUTHORS Lew,J.M.

JOURNAL Submitted (18-DEC-2012) Lew J., Ecole Polytechnique Federale de Lausanne, CH-1015, Lausanne, Switzerland, and the Swiss Institute of Bioinformatics, CMU - Rue Michel-Servet 1, 1211 Geneva 4, SWITZERLAND

COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence is identical to AL123456. On Feb 6, 2013 this sequence version replaced NC_000962.2. RefSeq Category: Reference Genome FGS: First Genome sequenced MOD: Model Organism TYS: Designated Type Strain UPR: UniProt Genome Note: This annotation is from the TubercuList website, Release 26, Dec 2012 (URL: http://tuberculist.epfl.ch) (email: tuberculist@epfl.ch). COMPLETENESS: full length.

FEATURES Location/Qualifiers

source 1..4411532 /organism="Mycobacterium tuberculosis H37Rv" /mol_type="genomic DNA" /strain="H37Rv" /db_xref="taxon:83332"

gene 1..1524 /gene="dnaA" /locus_tag="Rv0001" /experiment="DESCRIPTION:Mutation analysis, gene expression[PMID: 10375628]" /db_xref="GeneID:885041"

CDS 1..1524 /experiment="EXISTENCE:Mass spectrometry[PMID:15525680]" /experiment="EXISTENCE:Mass spectrometry[PMID:21085642]" /experiment="EXISTENCE:Mass spectrometry[PMID:21920479]" /inference="protein motif:PROSITE:PS00017" /inference="protein motif:PROSITE:PS01008" /codon_start=1 /transl_table=11 /product="chromosomal replication initiator protein DnaA" /protein_id="NP_214515.1" /translation="MTDDPGSGFTTVWNAVVSELNGDPKVDDGPSSDANLSAPLTPQQ RAWLNLVQPLTIVEGFALLSVPSSFVQNEIERHLRAPITDALSRRLGHQIQLGVRIAP PATDEADDTTVPPSENPATTSPDTTTDNDEIDDSAAARGDNQHSWPSYFTERPHNTDS ATAGVTSLNRRYTFDTFVIGASNRFAHAAALAIAEAPARAYNPLFIWGESGLGKTHLL HAAGNYAQRLFPGMRVKYVSTEEFTNDFINSLRDDRKVAFKRSYRDVDVLLVDDIQFI EGKEGIQEEFFHTFNTLHNANKQIVISSDRPPKQLATLEDRLRTRFEWGLITDVQPPE LETRIAILRKKAQMERLAVPDDVLELIASSIERNIRELEGALIRVTAFASLNKTPIDK ALAEIVLRDLIADANTMQISAATIMAATAEYFDTTVEELRGPGKTRALAQSRQIAMYL CRELTDLSLPKIGQAFGRDHTTVMYAQRKILSEMAERREVFDHVKELTTRIRQRSKR"

gene complement(77619..78896) /gene="glyA2" /locus_tag="Rv0070c" /db_xref="GeneID:886983"

 CDS             complement(77619..78896)
                 characterization[PMID:12913008]"
                 /inference="protein motif:PROSITE:PS00096"
                 /note="Belongs to the ShmT family. Cofactor: pyridoxal
                 phosphate."
                 /product="serine hydroxymethyltransferase"
                 /protein_id="NP_214584.1"
                 /translation="MNTLNDSLTAFDPDIAALIDGELRRQESGLEMIASENYAPLAVM
                 QAQGSVLTNKYAEGYPGRRYYGGCEFVDGVEQLAIDRVKALFGAEYANVQPHSGATAN
                 AATMHALLNPGDTILGLSLAHGGHLTHGMRINFSGKLYHATAYEVSKEDYLVDMDAVA
                 EAARTHRPKMIIAGWSAYPRQLDFARFRAIADEVDAVLMVDMAHFAGLVAAGVHPSPV
                 PHAHVVTSTTHKTLGGPRGGIILCNDPAIAKKINSAVFPGQQGGPLEHVIAAKATAFK
                 MAAQPEFAQRQQRCLDGARILAGRLTQPDVAERGIAVLTGGTDVHLVLVDLRDAELDG
                 QQAEDRLAAVDITVNRNAVPFDPRPPMITSGLRIGTPALAARGFSHNDFRAVADLIAA
                 ALTATNDDQLGPLRAQVQRLAARYPLYPELHRT"

 gene            79486..80193
                 /locus_tag="Rv0071"
                 /db_xref="GeneID:886988"

 CDS             79486..80193
                 /note="group II intron maturase family. Contains 5 VDP
                 repeats at N-terminus."
                 /product="maturase"
                 /protein_id="NP_214585.1"
                 /translation="MSSITVSVDPVDPVDPVDPVDPVDAVVAAGSDGLTVARIESEIG
                 ALEFLNELRTELKSGQFRPQPVRERKIPKPGGLGKVRRLGIPTVADRVVQAALKLVLE
                 PIFETDFEPVSYGFRPARRAHDTIAEIHLFGTQEYRWVLDADIKACFDRIDHADLMDR
                 VRHRIKDKRVLRLVNWQRIRHRWNWTDVRRWLTDPTGRWHPISADGITLFNPAAVPIR
                 RYRYRGNTIPTPWTQAV"

 repeat_region   79507..79551
                 /note="5 x 9 bp GTGGACCCG repeats"
 repeat_region   80236..80550
                 /note="(MTV030.15), len: 315 nt. Probable REP'-1
                 pseudogene fragment"

By fetching this file in perl i am able to read all the content of file and can print product name and locus tag but after that i can't print locus tag of a particular product which contains methyltransferase like this (/locus_tag="Rv0070c" - /product="serine hydroxymethyltransferase"). Can anyone help me to solve this problem.???

genome • 1.5k views
ADD COMMENT
0
Entering edit mode

Would it be possible to share the perl script (if not too long)?

Are you using BioPerl to read the embl/genbank files?

ADD REPLY
0
Entering edit mode

I can probably help you in python, but not in perl. Why does it need to be perl?

That said, you can use Bioperl in (almost) exactly the same way as BioPython

ADD REPLY

Login before adding your answer.

Traffic: 1007 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6