This site is a beta test.
Question: obtain selected elements from refseq using awk and regex
0
Entering edit mode
2.2 years ago
theoharis • 0

Supposing we have a text file such as the one by refseqgene (see example below). What is a suitable awk program (and regex) to create a new file with 4 columns - gene, synonym, note, summary:

        gene="AP3B2"
        gene_synonym="EIEE48; NAPTB"
        note="adaptor related protein complex 3 beta 2 subunit"
        Summary= "Adaptor protein complex 3 (AP-3 complex) is a
        heterotrimeric protein complex involved in the formation of
        clathrin-coated synaptic vesicles. The protein encoded by this gene
        represents the beta subunit of the neuron-specific AP-3 complex and
        was first identified as the target antigen in human paraneoplastic
        neurologic disorders. The encoded subunit binds clathrin and is
        phosphorylated by a casein kinase-like protein, which mediates
        synaptic vesicle coat assembly. Defects in this gene are a cause of
        early-onset epileptic encephalopathy. [provided by RefSeq, Feb
        2017]."
  
> LOCUS       NG_052957              57628 bp    DNA     linear   PRI
> 02-MAR-2017 DEFINITION  Homo sapiens adaptor related protein complex 3
> beta 2 subunit
>             (AP3B2), RefSeqGene on chromosome 15. ACCESSION   NG_052957 VERSION     NG_052957.1 KEYWORDS    RefSeq; RefSeqGene.
> SOURCE      Homo sapiens (human)   ORGANISM  Homo sapiens
>             Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
>             Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
>             Catarrhini; Hominidae; Homo. COMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The
>             reference sequence was derived from AC105339.9 and FJ695193.1.
>             This sequence is a reference standard in the RefSeqGene project.

> 
>             Summary: Adaptor protein complex 3 (AP-3 complex) is a
>             heterotrimeric protein complex involved in the formation of
>             clathrin-coated synaptic vesicles. The protein encoded by this gene
>             represents the beta subunit of the neuron-specific AP-3 complex and
>             was first identified as the target antigen in human paraneoplastic
>             neurologic disorders. The encoded subunit binds clathrin and is
>             phosphorylated by a casein kinase-like protein, which mediates
>             synaptic vesicle coat assembly. Defects in this gene are a cause of
>             early-onset epileptic encephalopathy. [provided by RefSeq, Feb
>             2017]. PRIMARY     REFSEQ_SPAN         PRIMARY_IDENTIFIER PRIMARY_SPAN        COMP
>             1-35060             AC105339.9         88079-123138
>             35061-35259         FJ695193.1         1-199               c
>             35260-57628         AC105339.9         123337-145705 FEATURES             Location/Qualifiers
>      source          1..57628
>                      /organism="Homo sapiens"
>                      /mol_type="genomic DNA"
>                      /db_xref="taxon:9606"
>                      /chromosome="15"
>                      /map="15q25.2"
>      gene            916..4438
>                      /gene="LOC338963"
>                      /note="epididymal protein pseudogene"
>                      /pseudo
>                      /db_xref="GeneID:338963"
>      misc_RNA        join(916..1179,2602..3348,3477..3722,4334..4438)
>                      /gene="LOC338963"
>                      /product="epididymal protein pseudogene"
>                      /exception="mismatches in transcription"
>                      /pseudo
>                      /transcript_id="NR_034139.1"
>                      /db_xref="GeneID:338963"
>      gene            5001..55628
>                      /gene="AP3B2"
>                      /gene_synonym="EIEE48; NAPTB"
>                      /note="adaptor related protein complex 3 beta 2 subunit"
>                      /db_xref="GeneID:8120"
>                      /db_xref="HGNC:HGNC:567"
>                      /db_xref="MIM:602166"
>      mRNA            join(5001..5315,25456..25531,25677..25751,26078..26173,
>                      33329..33489,33731..33797,33890..34072,34154..34437,
>                      34680..34734,35109..35180,36742..36804,37106..37238,
>                      37526..37635,38272..38448,47976..48162,49334..49452,
>                      49606..49662,49966..50074,50419..50542,50934..51108,
>                      51289..51349,51676..51782,51987..52215,52657..52741,
>                      52987..53084,54926..55064,55199..55628)
>                      /gene="AP3B2"
>                      /gene_synonym="EIEE48; NAPTB"
>                      /product="adaptor related protein complex 3 beta 2
>                      subunit, transcript variant 1"
>                      /transcript_id="NM_001278512.1"
>                      /db_xref="GeneID:8120"
>                      /db_xref="HGNC:HGNC:567"
>                      /db_xref="MIM:602166"
>      exon            5001..5315
>                      /gene="AP3B2"
>                      /gene_synonym="EIEE48; NAPTB"
>                      /inference="alignment:Splign:2.0.8"
>                      /number=1
>      CDS             join(5203..5315,25456..25531,25677..25751,26078..26173,
>                      33329..33489,33731..33797,33890..34072,34154..34437,
>                      34680..34734,35109..35180,36742..36804,37106..37238,
>                      37526..37635,38272..38448,47976..48162,49334..49452,
>                      49606..49662,49966..50074,50419..50542,50934..51108,
>                      51289..51349,51676..51782,51987..52215,52657..52741,
>                      52987..53084,54926..55064,55199..55349)
>                      /gene="AP3B2"
>                      /gene_synonym="EIEE48; NAPTB"
>                      /note="isoform 1 is encoded by transcript variant 1;
>                      Neuronal adaptin-like protein, beta-subunit; AP-3 complex
>                      subunit beta-2; beta-3B-adaptin; adaptor protein complex
>                      AP-3 subunit beta-2; neuron-specific vesicle coat protein
>                      beta-NAP; clathrin assembly protein complex 3 beta-2 large
>                      chain; adaptor-related protein complex 3 subunit beta-2"
>                      /codon_start=1
>                      /product="AP-3 complex subunit beta-2 isoform 1"
>                      /protein_id="NP_001265441.1"
>                      /db_xref="CCDS:CCDS61737.1"
>                      /db_xref="GeneID:8120"
>                      /db_xref="HGNC:HGNC:567"
>                      /db_xref="MIM:602166"
>                      /translation="MSAAPAYSEDKGGSAGPGEPEYGHDPASGGIFSSDYKRHDDLKE
>                      MLDTNKDSLKLEAMKRIVAMIARGKNASDLFPAVVKNVACKNIEVKKLVYVYLVRYAE
>                      EQQDLALLSISTFQRGLKDPNQLIRASALRVLSSIRVPIIVPIMMLAIKEAASDMSPY
>                      VRKTAAHAIPKLYSLDSDQKDQLIEVIEKLLADKTTLVAGSVVMAFEEVCPERIDLIH
>                      KNYRKLCNLLIDVEEWGQVVIISMLTRYARTQFLSPTQNESLLEENAEKAFYGSEEDE
>                      AKGAGSEETAAAAAPSRKPYVMDPDHRLLLRNTKPLLQSRSAAVVMAVAQLYFHLAPK
>                      AEVGVIAKALVRLLRSHSEVQYVVLQNVATMSIKRRGMFEPYLKSFYIRSTDPTQIKI
>                      LKLEVLTNLANETNIPTVLREFQTYIRSMDKDFVAATIQAIGRCATNIGRVRDTCLNG
>                      LVQLLSNRDELVVAESVVVIKKLLQMQPAQHGEIIKHLAKLTDNIQVPMARASILWLI
>                      GEYCEHVPRIAPDVLRKMAKSFTAEEDIVKLQVINLAAKLYLTNSKQTKLLTQYVLSL
>                      AKYDQNYDIRDRARFTRQLIVPSEQGGALSRHAKKLFLAPKPAPVLESSFKDRDHFQL
>                      GSLSHLLNAKATGYQELPDWPEEAPDPSVRNVEEEDLSLIETHVGLLGEYTEVPEWTK
>                      CSNREKRKEKEKPFYSDSEGESGPTESADSDPESESESDSKSSSESGSGESSSESDNE
>                      DQDEDEEKGRGSESEQSEEDGKRKTKKKVPERKGEASSSDEGSDSSSSSSESEMTSES
>                      EEEQLEPASWSRKTPPSSKSAPATKEISLLDLEDFTPPSVQPVSPPAIVSTSLAADLE
>                      GLTLTDSTLVPSLLSPVSGVGRQELLHRVAGEGLAVDYTFSRQPFSGDPHMVSVHIHF
>                      SNSSDTPIKGLHVGTPKLPAGISIQEFPEIESLAPGESATAVMGINFCDSTQAANFQL
>                      CTQTRQFYVSIQPPVGELMAPVFMSENEFKKEQGKLMGMNEITEKLMLPDTCRSDHIV
>                      VQKVTATANLGRVPCGTSDEYRFAGRTLTGGSLVLLTLDARPAGAAQLTVNSEKMVIG
>                      TMLVKDVIQALTQ"
>      gene            complement(22089..>57628)
>                      /gene="CPEB1-AS1"
>                      /note="CPEB1 antisense RNA 1"
>                      /db_xref="GeneID:283692"
>                      /db_xref="HGNC:HGNC:27523"
>      ncRNA           complement(22089..>22898)
>                      /ncRNA_class="lncRNA"
>                      /gene="CPEB1-AS1"
>                      /product="CPEB1 antisense RNA 1"
>                      /inference="similar to RNA sequence (same
>                      species):RefSeq:NR_046096.1"
>                      /exception="annotated by transcript or proteomic data"
>                      /transcript_id="NR_046096.1"
>                      /db_xref="GeneID:283692"
>                      /db_xref="HGNC:HGNC:27523"
>      gene            22457..23383
>                      /gene="LOC100421235"
>                      /note="serine and arginine rich splicing factor 9
>                      pseudogene"
>                      /pseudo
>                      /db_xref="GeneID:100421235"
>      exon            25456..25531
>                      /gene="AP3B2"
>                      /gene_synonym="EIEE48; NAPTB"
>                      /inference="alignment:Splign:2.0.8"
>                      /number=2 ORIGIN      
>         1 gtccccatgg ggtgggtggc atgatcaggc caggtgcccc aggagtggga gtctctgttc
>        61 cctgggctct tacagctcca gggccttgcc cccttttctt tcttacaaag aaaacggtgg
>       121 cttgactcag caaaaactaa gaagggtagc tgtttctcca ggtcaggaag gatacggggg
>       181 tcagcacttc ctggcagttg agtctgggga agggggacct cacatgccag cagcgtgaga
>       241 aagatgatac tgtacagtgg tgaaggacac gggcactgga gccagaccac ttggcctgaa
>       301 tactggttgt gccgcttacc agcttgtaac ctctccaagc ctcagtttcc ccatctgtaa
>       361 aatgggaagt ataacatcat ctacttcaag tcattattgt tagggctaaa tgatgcttta //
ADD COMMENTlink 2.2 years ago theoharis • 0
Entering edit mode
0

Take a look at parsing genbank files in BioPython.

ADD REPLYlink 2.2 years ago
st.ph.n
♦ 2.5k
Entering edit mode
0

the question was about using awk and refseqgene file format :)

ADD REPLYlink 2.2 years ago
theoharis
• 0
Entering edit mode
0

Have you tried an awk or refseq command? If so, post it, and any errors/outputs you're getting. Then, someone may be able to help. You are asking for a service to be done here, i.e. a usable command to be provided, without input from you as the OP, and that is not the purpose of the site. Given your file type, I provided a suggestion, to get you started in formulating your own code that will parse your file in the manner that you wish.

ADD REPLYlink 2.2 years ago
st.ph.n
♦ 2.5k

Login before adding your answer.

Powered by the version 1.5.2