obtain selected elements from refseq using awk and regex
0
0
Entering edit mode
6.6 years ago
theoharis ▴ 40

Supposing we have a text file such as the one by refseqgene (see example below). What is a suitable awk program (and regex) to create a new file with 4 columns - gene, synonym, note, summary:

        gene="AP3B2"
        gene_synonym="EIEE48; NAPTB"
        note="adaptor related protein complex 3 beta 2 subunit"
        Summary= "Adaptor protein complex 3 (AP-3 complex) is a
        heterotrimeric protein complex involved in the formation of
        clathrin-coated synaptic vesicles. The protein encoded by this gene
        represents the beta subunit of the neuron-specific AP-3 complex and
        was first identified as the target antigen in human paraneoplastic
        neurologic disorders. The encoded subunit binds clathrin and is
        phosphorylated by a casein kinase-like protein, which mediates
        synaptic vesicle coat assembly. Defects in this gene are a cause of
        early-onset epileptic encephalopathy. [provided by RefSeq, Feb
        2017]."
  
> LOCUS       NG_052957              57628 bp    DNA     linear   PRI
> 02-MAR-2017 DEFINITION  Homo sapiens adaptor related protein complex 3
> beta 2 subunit
>             (AP3B2), RefSeqGene on chromosome 15. ACCESSION   NG_052957 VERSION     NG_052957.1 KEYWORDS    RefSeq; RefSeqGene.
> SOURCE      Homo sapiens (human)   ORGANISM  Homo sapiens
>             Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
>             Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
>             Catarrhini; Hominidae; Homo. COMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The
>             reference sequence was derived from AC105339.9 and FJ695193.1.
>             This sequence is a reference standard in the RefSeqGene project.

> 
>             Summary: Adaptor protein complex 3 (AP-3 complex) is a
>             heterotrimeric protein complex involved in the formation of
>             clathrin-coated synaptic vesicles. The protein encoded by this gene
>             represents the beta subunit of the neuron-specific AP-3 complex and
>             was first identified as the target antigen in human paraneoplastic
>             neurologic disorders. The encoded subunit binds clathrin and is
>             phosphorylated by a casein kinase-like protein, which mediates
>             synaptic vesicle coat assembly. Defects in this gene are a cause of
>             early-onset epileptic encephalopathy. [provided by RefSeq, Feb
>             2017]. PRIMARY     REFSEQ_SPAN         PRIMARY_IDENTIFIER PRIMARY_SPAN        COMP
>             1-35060             AC105339.9         88079-123138
>             35061-35259         FJ695193.1         1-199               c
>             35260-57628         AC105339.9         123337-145705 FEATURES             Location/Qualifiers
>      source          1..57628
>                      /organism="Homo sapiens"
>                      /mol_type="genomic DNA"
>                      /db_xref="taxon:9606"
>                      /chromosome="15"
>                      /map="15q25.2"
>      gene            916..4438
>                      /gene="LOC338963"
>                      /note="epididymal protein pseudogene"
>                      /pseudo
>                      /db_xref="GeneID:338963"
>      misc_RNA        join(916..1179,2602..3348,3477..3722,4334..4438)
>                      /gene="LOC338963"
>                      /product="epididymal protein pseudogene"
>                      /exception="mismatches in transcription"
>                      /pseudo
>                      /transcript_id="NR_034139.1"
>                      /db_xref="GeneID:338963"
>      gene            5001..55628
>                      /gene="AP3B2"
>                      /gene_synonym="EIEE48; NAPTB"
>                      /note="adaptor related protein complex 3 beta 2 subunit"
>                      /db_xref="GeneID:8120"
>                      /db_xref="HGNC:HGNC:567"
>                      /db_xref="MIM:602166"
>      mRNA            join(5001..5315,25456..25531,25677..25751,26078..26173,
>                      33329..33489,33731..33797,33890..34072,34154..34437,
>                      34680..34734,35109..35180,36742..36804,37106..37238,
>                      37526..37635,38272..38448,47976..48162,49334..49452,
>                      49606..49662,49966..50074,50419..50542,50934..51108,
>                      51289..51349,51676..51782,51987..52215,52657..52741,
>                      52987..53084,54926..55064,55199..55628)
>                      /gene="AP3B2"
>                      /gene_synonym="EIEE48; NAPTB"
>                      /product="adaptor related protein complex 3 beta 2
>                      subunit, transcript variant 1"
>                      /transcript_id="NM_001278512.1"
>                      /db_xref="GeneID:8120"
>                      /db_xref="HGNC:HGNC:567"
>                      /db_xref="MIM:602166"
>      exon            5001..5315
>                      /gene="AP3B2"
>                      /gene_synonym="EIEE48; NAPTB"
>                      /inference="alignment:Splign:2.0.8"
>                      /number=1
>      CDS             join(5203..5315,25456..25531,25677..25751,26078..26173,
>                      33329..33489,33731..33797,33890..34072,34154..34437,
>                      34680..34734,35109..35180,36742..36804,37106..37238,
>                      37526..37635,38272..38448,47976..48162,49334..49452,
>                      49606..49662,49966..50074,50419..50542,50934..51108,
>                      51289..51349,51676..51782,51987..52215,52657..52741,
>                      52987..53084,54926..55064,55199..55349)
>                      /gene="AP3B2"
>                      /gene_synonym="EIEE48; NAPTB"
>                      /note="isoform 1 is encoded by transcript variant 1;
>                      Neuronal adaptin-like protein, beta-subunit; AP-3 complex
>                      subunit beta-2; beta-3B-adaptin; adaptor protein complex
>                      AP-3 subunit beta-2; neuron-specific vesicle coat protein
>                      beta-NAP; clathrin assembly protein complex 3 beta-2 large
>                      chain; adaptor-related protein complex 3 subunit beta-2"
>                      /codon_start=1
>                      /product="AP-3 complex subunit beta-2 isoform 1"
>                      /protein_id="NP_001265441.1"
>                      /db_xref="CCDS:CCDS61737.1"
>                      /db_xref="GeneID:8120"
>                      /db_xref="HGNC:HGNC:567"
>                      /db_xref="MIM:602166"
>                      /translation="MSAAPAYSEDKGGSAGPGEPEYGHDPASGGIFSSDYKRHDDLKE
>                      MLDTNKDSLKLEAMKRIVAMIARGKNASDLFPAVVKNVACKNIEVKKLVYVYLVRYAE
>                      EQQDLALLSISTFQRGLKDPNQLIRASALRVLSSIRVPIIVPIMMLAIKEAASDMSPY
>                      VRKTAAHAIPKLYSLDSDQKDQLIEVIEKLLADKTTLVAGSVVMAFEEVCPERIDLIH
>                      KNYRKLCNLLIDVEEWGQVVIISMLTRYARTQFLSPTQNESLLEENAEKAFYGSEEDE
>                      AKGAGSEETAAAAAPSRKPYVMDPDHRLLLRNTKPLLQSRSAAVVMAVAQLYFHLAPK
>                      AEVGVIAKALVRLLRSHSEVQYVVLQNVATMSIKRRGMFEPYLKSFYIRSTDPTQIKI
>                      LKLEVLTNLANETNIPTVLREFQTYIRSMDKDFVAATIQAIGRCATNIGRVRDTCLNG
>                      LVQLLSNRDELVVAESVVVIKKLLQMQPAQHGEIIKHLAKLTDNIQVPMARASILWLI
>                      GEYCEHVPRIAPDVLRKMAKSFTAEEDIVKLQVINLAAKLYLTNSKQTKLLTQYVLSL
>                      AKYDQNYDIRDRARFTRQLIVPSEQGGALSRHAKKLFLAPKPAPVLESSFKDRDHFQL
>                      GSLSHLLNAKATGYQELPDWPEEAPDPSVRNVEEEDLSLIETHVGLLGEYTEVPEWTK
>                      CSNREKRKEKEKPFYSDSEGESGPTESADSDPESESESDSKSSSESGSGESSSESDNE
>                      DQDEDEEKGRGSESEQSEEDGKRKTKKKVPERKGEASSSDEGSDSSSSSSESEMTSES
>                      EEEQLEPASWSRKTPPSSKSAPATKEISLLDLEDFTPPSVQPVSPPAIVSTSLAADLE
>                      GLTLTDSTLVPSLLSPVSGVGRQELLHRVAGEGLAVDYTFSRQPFSGDPHMVSVHIHF
>                      SNSSDTPIKGLHVGTPKLPAGISIQEFPEIESLAPGESATAVMGINFCDSTQAANFQL
>                      CTQTRQFYVSIQPPVGELMAPVFMSENEFKKEQGKLMGMNEITEKLMLPDTCRSDHIV
>                      VQKVTATANLGRVPCGTSDEYRFAGRTLTGGSLVLLTLDARPAGAAQLTVNSEKMVIG
>                      TMLVKDVIQALTQ"
>      gene            complement(22089..>57628)
>                      /gene="CPEB1-AS1"
>                      /note="CPEB1 antisense RNA 1"
>                      /db_xref="GeneID:283692"
>                      /db_xref="HGNC:HGNC:27523"
>      ncRNA           complement(22089..>22898)
>                      /ncRNA_class="lncRNA"
>                      /gene="CPEB1-AS1"
>                      /product="CPEB1 antisense RNA 1"
>                      /inference="similar to RNA sequence (same
>                      species):RefSeq:NR_046096.1"
>                      /exception="annotated by transcript or proteomic data"
>                      /transcript_id="NR_046096.1"
>                      /db_xref="GeneID:283692"
>                      /db_xref="HGNC:HGNC:27523"
>      gene            22457..23383
>                      /gene="LOC100421235"
>                      /note="serine and arginine rich splicing factor 9
>                      pseudogene"
>                      /pseudo
>                      /db_xref="GeneID:100421235"
>      exon            25456..25531
>                      /gene="AP3B2"
>                      /gene_synonym="EIEE48; NAPTB"
>                      /inference="alignment:Splign:2.0.8"
>                      /number=2 ORIGIN      
>         1 gtccccatgg ggtgggtggc atgatcaggc caggtgcccc aggagtggga gtctctgttc
>        61 cctgggctct tacagctcca gggccttgcc cccttttctt tcttacaaag aaaacggtgg
>       121 cttgactcag caaaaactaa gaagggtagc tgtttctcca ggtcaggaag gatacggggg
>       181 tcagcacttc ctggcagttg agtctgggga agggggacct cacatgccag cagcgtgaga
>       241 aagatgatac tgtacagtgg tgaaggacac gggcactgga gccagaccac ttggcctgaa
>       301 tactggttgt gccgcttacc agcttgtaac ctctccaagc ctcagtttcc ccatctgtaa
>       361 aatgggaagt ataacatcat ctacttcaag tcattattgt tagggctaaa tgatgcttta //
refseq awk regex • 1.7k views
ADD COMMENT
0
Entering edit mode

Take a look at parsing genbank files in BioPython.

ADD REPLY
0
Entering edit mode

the question was about using awk and refseqgene file format :)

ADD REPLY
0
Entering edit mode

Have you tried an awk or refseq command? If so, post it, and any errors/outputs you're getting. Then, someone may be able to help. You are asking for a service to be done here, i.e. a usable command to be provided, without input from you as the OP, and that is not the purpose of the site. Given your file type, I provided a suggestion, to get you started in formulating your own code that will parse your file in the manner that you wish.

ADD REPLY

Login before adding your answer.

Traffic: 2629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6