Question

Protein coding mm10 refseq bed

2

Entering edit mode

5.5 years ago

rbronste ▴ 420

Just trying to export a bed file from table browser for protein coding gene body locations in mm10 containing the following header/columns:

chr start end NA genename NMname strand

Not sure if there is a more straightforward way to get the following arrangement, thanks!

refseq bed mm10 • 3.5k views

ADD COMMENT • link updated 5.5 years ago by vkkodali_ncbi ★ 3.7k • written 5.5 years ago by rbronste ▴ 420

score 2 · Answer 1 · 2018-10-25

2

Entering edit mode

5.5 years ago

Arup Ghosh 3.2k

Use the Selected fields option in Output format and click on get output then choose required columns from selection page.

Link to table browser

Table Browser

Select columns:

Selection Page

ADD COMMENT • link 5.5 years ago by Arup Ghosh 3.2k

score 0 · Answer 2 · 2018-10-25

If you are interested in RefSeq data, why not download the GFF3 annotation from NCBI and parse that file? You can download the GFF3 file from RefSeq FTP site here:

ftp://ftp.ncbi.nlm.nih.gov/genomes/Mus_musculus/GFF_interim/interim_GRCm38.p6_top_level_2017-09-26.gff3.gz

A gene can be protein-coding and yet have one or more non-coding transcript variants. Hence, you need to first get the list of gene_ids that are coding at least one protein. You can do so by parsing the GFF3 file as follows:

zgrep -v '^#' interim_GRCm38.p6_top_level_2017-09-26.gff3.gz | awk 'BEGIN{FS="\t";OFS="\t"}($3=="CDS"){print $9}' | grep -o 'GeneID:[0-9]*' | sort -u > ~/GRCm38.p6_protein_coding_genes.txt

Then, you can grep for those geneids in the GFF3 file where the column 3 has gene to get the entire range of the gene and strand. It is unclear to me whether you are interested in just the range for gene or each transcript variant (because one of your columns is NM). Depending on exactly what you want, it is fairly easy to come up with an appropriate unix command to parse the GFF3 file and return a bed-style file.