Question

Genome Music Evalutation Of Non-Coding / Intergenic Regions

1

Entering edit mode

10.7 years ago

DoubleD ▴ 130

I had asked this question in a message but should have posted it here; with answer from Cyriac.

Cyriac,

Thanks for your post about pathway file generation. We are having great luck with our data now. We have been using the "ensembl_67_cds_ncrna_and_splice_sites_hg19" as our region of interest for most of the MuSiC suite. We had sequenced 60x coverage of our tumors, whole genome. We have been trying to make an 'intergenic' ROI file in order to look at the rest of our data; two thoughts were to make a ROI that goes from base 1 to n of each chromosome, the other was to make a file that spanned the regions between the exons of the file listed above. Is this a reasonable investigation, or is MuSiC designed more to look at exome data?

Thank you, David

PM from Cyriac Kandoth:

Hi david. You can post your question to Biostar. I'm sure the answer will be of interest to other users as well.. In short: yes, it's a worthwhile investigation. We have previously used MuSiC's SMG test to find significantly altered non-coding regions... with mixed results, but interesting anyhow.

You can download intron loci in GTF format from Ensembl. Here is their latest from release 72. Look for the Human GTF at this link: http://useast.ensembl.org/info/data/ftp/index.html

It will need a bit of scripting to convert the GTF into a format that MuSiC likes.

~Cyriac

music genome intron • 2.9k views

ADD COMMENT • link updated 8.0 years ago by Biostar 20 • written 10.7 years ago by DoubleD ▴ 130

0

Entering edit mode

Thank you for the input and the link.

Two further questions - if we use the "ensembl_67_cds_ncrna_and_splice_sites_hg19" as our ROI file for calc-covg, bmr and smg, is there no point to using the noskip-non-coding option? It seems that if our ROI is the exons, then having it include non-coding mutations would give erroneous data since calc-covg didn't look at those regions, and BMR used those regions (the introns) to calculate the background mutation rate?

Second - if I am understanding correctly, the ensembl database uses the GENCODE dataset which is based on the protein-coding loci, and in the GFT file I cannot find any "intron_CNS" features; it seems to only contain exon and CDS sites. I looked in some earlier versions and couldn't find intron regions there either. Should I create a 'gene region' that spans the start to end, including all flanking, intron, exon and CDS sites, or are there defined intron regions somewhere that I'm missing?

Thank you, DD

Edit: Regarding the second question, I was able to get an output of introns from the UCSC Browser into BED format and I'm reconfiguring that into a ROI file for MuSiC. i guess the general question stands: does the GENCODE dataset currently include introns?

ADD REPLY • link 10.7 years ago by DoubleD ▴ 130

score 1 · Answer 1 · 2013-08-17

"ensembl_67_cds_ncrna_and_splice_sites_hg19" includes non-coding RNA genes (ncRNA) that can sometimes be the target of important variants. The "--skip-non-coding" option does not skip any ROIs, but it skips non-coding variants when parsing the MAF file, specifically those annotated as Intron, RNA, 3'Flank, 3'UTR, 5'Flank, 5'UTR, IGR, or Targeted_Region in column 9 of the MAF format. If you're familiar with Perl, check out this line of code for calc-bmr. Similarly, this line of code will exclude any variants that fall outside the ROIs you've defined.

Gencode includes plenty of non-coding genes as well: miRNA, lncRNA, siRNA, etc. A breakdown of Gencode v17 shows that only 20,330 out of 57,281 Gencode genes are protein-coding.

The GTF I pointed you to, doesn't explicitly define intronic loci. But you can extract the non-CDS bps of each gene using bedtools subtract. Below is some shell/Perl code to pull that off, starting with the GTF from Gencode v17.

Create 2 lists of genomic loci in UCSC BED format. One with the 5' and 3' bounds of each gene (either protein-coding or non-coding), and the other with all CDS loci.

wget ftp://ftp.sanger.ac.uk/pub/gencode/release_17/gencode.v17.annotation.gtf.gz
gunzip gencode.v17.annotation.gtf.gz
grep -v ^# gencode.v17.annotation.gtf | perl -ne 'chomp; @c=split(/\t/); $c[0]=~s/^chr//; $c[3]--; $c[8]=~s/.*gene_name\s\"([^"]+)\".*/$1/; print join("\t",@c[0,3,4,8,5,6])."\n" if($c[2] eq "gene")' > gene_bounds.bed
grep -v ^# gencode.v17.annotation.gtf | perl -ne 'chomp; @c=split(/\t/); $c[0]=~s/^chr//; $c[3]--; $c[8]=~s/.*gene_name\s\"([^"]+)\".*/$1/; print join("\t",@c[0,3,4,8,5,6])."\n" if($c[2] eq "CDS")' > all_cds_loci.bed

Use bedtools subtract aka subtractBed to create a new list of genomic loci, that excludes all_cds_loci from gene_bounds (correctly handling strandedness). At the same time, use some inline Perl to convert the UCSC 0-based start loci to 1-based loci, so that it can be used as a MuSiC ROI file.

subtractBed -s -a gene_bounds.bed -b all_cds_loci.bed | perl -ane '$F[1]++; print join("\t",@F[0..3])."\n"' > non_coding_roi_file_for_music.tsv