Question

Extracting genomic coordinate, feature type and strand for a gene list

0

Entering edit mode

5.6 years ago

seta ★ 1.9k

Hi all,

I have a gene list containing about 5000 genes (gene name and the related Entrez Gene ID), I would like to extract the chromosome number, genomic coordinates, feature type (promoter, gene,transcript,exon,CDS,UTR,start_codon,stop_codon) and genomic strand for this gene list. Could you please help me out on this issue? please kindly share me any tool or command.

Thanks

gene list genomic coordinate feature type • 1.4k views

ADD COMMENT • link 5.6 years ago by seta ★ 1.9k

0

Entering edit mode

Try biomart or apis from major repositories

ADD REPLY • link 5.6 years ago by cpad0112 21k

0

Entering edit mode

Except the promoter region, all these information are available from a GTF file for your species. Given you are working with human, download the current GTF e.g. from GENCODE and grep/zgrep for the respective gene names. From there on, you can further subset for the features you want. Here are information about the GTF format.

Example for the gene CEBPA (subset):

zgrep -w 'CEBPA' gencode.v28.annotation.gtf.gz
chr19   HAVANA  gene    33299934    33302564    .   -   .   gene_id "ENSG00000245848.2"; gene_type "protein_coding"; gene_name "CEBPA"; level 2; havana_gene "OTTHUMG00000161461.1";
chr19   HAVANA  transcript  33299934    33302564    .   -   .   gene_id "ENSG00000245848.2"; transcript_id "ENST00000498907.2"; gene_type "protein_coding"; gene_name "CEBPA"; transcript_type "protein_coding"; transcript_name "CTD-2540B15.2-001"; level 2; protein_id "ENSP00000427514.1"; transcript_support_level "NA"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS54243.1"; havana_gene "OTTHUMG00000161461.1"; havana_transcript "OTTHUMT00000365012.1";
chr19   HAVANA  exon    33299934    33302564    .   -   .   gene_id "ENSG00000245848.2"; transcript_id "ENST00000498907.2"; gene_type "protein_coding"; gene_name "CEBPA"; transcript_type "protein_coding"; transcript_name "CTD-2540B15.2-001"; exon_number 1; exon_id "ENSE00001973852.2"; level 2; protein_id "ENSP00000427514.1"; transcript_support_level "NA"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS54243.1"; havana_gene "OTTHUMG00000161461.1"; havana_transcript "OTTHUMT00000365012.1";
chr19   HAVANA  CDS 33301341    33302414    .   -   0   gene_id "ENSG00000245848.2"; transcript_id "ENST00000498907.2"; gene_type "protein_coding"; gene_name "CEBPA"; transcript_type "protein_coding"; transcript_name "CTD-2540B15.2-001"; exon_number 1; exon_id "ENSE00001973852.2"; level 2; protein_id "ENSP00000427514.1"; transcript_support_level "NA"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS54243.1"; havana_gene "OTTHUMG00000161461.1"; havana_transcript "OTTHUMT00000365012.1";
chr19   HAVANA  start_codon 33302412    33302414    .   -   0   gene_id "ENSG00000245848.2"; transcript_id "ENST00000498907.2"; gene_type "protein_coding"; gene_name "CEBPA"; transcript_type "protein_coding"; transcript_name "CTD-2540B15.2-001"; exon_number 1; exon_id "ENSE00001973852.2"; level 2; protein_id "ENSP00000427514.1"; transcript_support_level "NA"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS54243.1"; havana_gene "OTTHUMG00000161461.1"; havana_transcript "OTTHUMT00000365012.1";
chr19   HAVANA  stop_codon  33301338    33301340    .   -   0   gene_id "ENSG00000245848.2"; transcript_id "ENST00000498907.2"; gene_type "protein_coding"; gene_name "CEBPA"; transcript_type "protein_coding"; transcript_name "CTD-2540B15.2-001"; exon_number 1; exon_id "ENSE00001973852.2"; level 2; protein_id "ENSP00000427514.1"; transcript_support_level "NA"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS54243.1"; havana_gene "OTTHUMG00000161461.1"; havana_transcript "OTTHUMT00000365012.1";
chr19   HAVANA  UTR 33299934    33301340    .   -   .   gene_id "ENSG00000245848.2"; transcript_id "ENST00000498907.2"; gene_type "protein_coding"; gene_name "CEBPA"; transcript_type "protein_coding"; transcript_name "CTD-2540B15.2-001"; exon_number 1; exon_id "ENSE00001973852.2"; level 2; protein_id "ENSP00000427514.1"; transcript_support_level "NA"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS54243.1"; havana_gene "OTTHUMG00000161461.1"; havana_transcript "OTTHUMT00000365012.1";
chr19   HAVANA  UTR 33302415    33302564    .   -   .   gene_id "ENSG00000245848.2"; transcript_id "ENST00000498907.2"; gene_type "protein_coding"; gene_name "CEBPA"; transcript_type "protein_coding"; transcript_name "CTD-2540B15.2-001"; exon_number 1; exon_id "ENSE00001973852.2"; level 2; protein_id "ENSP00000427514.1"; transcript_support_level "NA"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS54243.1"; havana_gene "OTTHUMG00000161461.1"; havana_transcript "OTTHUMT00000365012.1";

For the promoter region, I am not sure if there are actually databases. For matters of simplicity, using the 250bp upstream of the first exon sounds like a reasonable approach to me.

ADD REPLY • link 5.5 years ago by ATpoint 81k

0

Entering edit mode

Thanks for your response. Could you please tell me if zgrep get the gene list? Regarding promoter region, I think about 1000 bp upstream of transcription start site (TSS) is OK, any suggestions?

ADD REPLY • link 5.5 years ago by seta ★ 1.9k

0

Entering edit mode

I do not understand what you mean by 'get the gene list'. Please explain. 1kb is quiet big, too big for my taste. A typical ATAC-seq peak (open chromatin) at a promoter is typically like 500bp. If you only go for the nucleosome-free region, it is about 200bp. Depends on your goal, but if you plan to check for motif enrichment, better go for a smaller than for a larger region.

ADD REPLY • link 5.5 years ago by ATpoint 81k

0

Entering edit mode

I have a gene list containing about 5000 gene name with Entrez Gene ID. My mean is: if it is possible to get the required information for this gene list with zgrep, simultaneously, instead of typing just one gene name as you show in the example? Thank you for your point about promoter region, I would like to examine the variants in this region.

ADD REPLY • link 5.5 years ago by seta ★ 1.9k

0

Entering edit mode

Yes grep (or it's companion zgrep which searches gzipped files) can search for multiple patterns, using the -E parameter. Please spend some quality time on learning the basics of the Unix command line, especially awk and grep. You'll find plenty of tutorials on the web. I understand that you probably would prefer a ready-to-use script, but depending on what you want to do with the output, you'll need further commands for subsequent filtering. Therefore it is very advisable to first get a background in the Unix tools. You'll need that all the time :-)

ADD REPLY • link 5.5 years ago by ATpoint 81k