Creating A Roi File For The 1000 Genomes Project Reference Version37
1
1
Entering edit mode
10.6 years ago
DoubleD ▴ 130

I am trying to make a comprehensive ROI file for music from the 1000 genomes project, since our BAMs and callers used that reference fasta (the human_g1k_v37.fasta). My question is about making sure that I have the right GTF file defining the exon and CDS sequences such that I can make a ROI file for the v37 reference.

On the 1000 genomes site I found a README for the gencode GTF in

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/

where the fasta reference was, but no associated file. There were GTF annotion files in

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase1/analysis_results/functional_annotation/annotation_sets/

It seems that the file gencode7.coding.20120326.gtf.gz had only CDS and start/stop_codons, but not the exon regions. Based on the GTF suggested by Cyriac in his excellent post about ROI file creation Best Reference Sequence For Music And "Bit_Test" Error, I kept looking for a GTF that defined CDS and exon regions. The file gencode7_GRCh37.tgz had a Level 1&2 data file that had such sequence, but I worry that with GRCh37 in the name, I'm going to run into problems with chromosome/gene addresses (BAMs and SNPs called using 1000 genomes v37 reference). Is there a way of knowing if this GTF file is the right annotation for the human_g1k_v37.fasta?

Thank you, DD

PS-To make things even more confusing, the gencode7_GRCh37.tgz file "gencode.v7.annotation.level_1_2.gtf" that I used for ROI creation had "chr1" which I understand to be the UCSC hg19 naming convention; so this file is on the 1000 genomes server, with GRCh37 in the name, and hg19 autosome naming conventions; is this the right file?

music 1000genomes • 3.0k views
ADD COMMENT
1
Entering edit mode
10.6 years ago

I wouldn't recommend using Gencode 7, because it's severely outdated. The latest Gencode release is 18, and the GTF can be found at gencodegenes.org. You can do something like this to create a BED file containing CDS/exon loci:

curl -LO ftp://ftp.sanger.ac.uk/pub/gencode/release_18/gencode.v18.annotation.gtf.gz
gunzip gencode.v18.annotation.gtf.gz
grep -v ^# gencode.v18.annotation.gtf | perl -ne 'chomp; @c=split(/\t/); $c[0]=~s/^chr//; $c[3]--; $c[8]=~s/.*gene_name\s\"([^"]+)\".*/$1/; print join("\t",@c[0,3,4,8,5,6])."\n" if($c[2] eq "CDS" or $c[2] eq "exon")' > all_exon_loci.bed
ADD COMMENT

Login before adding your answer.

Traffic: 1454 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6