Request for karyotypically sorted Ensembl reference fasta and its dbSNP vcf for GATK workflow
0
0
Entering edit mode
8.2 years ago
umn_bist ▴ 390

I have RNA-seq bam files that I need to call somatic variants. The problem is that GATK is very strict with how the bam is formatted (karyotypically sorted, no 'chr' notation, read group).

Because my bam file was aligned against Ensembl reference I keep running into validation errors. For example I have to change the chromosome notation in the header which I am hesitant after many failures (samtools view --> sed --> reheader) and I am stuck on error as well:

"Discordant contig lengths: read MT LN=16571, ref MT LN=16569" (note that I was referencing against GATK's homo sapiens hg19 reference)

Does anyone have an Ensembl reference and its corresponding dbSNP useable for GATK? There is the Ensembl ftp I can access but I am quite lost with which files are the right ones. Thank you very much for your help.

GATK RNA-Seq GrCh37 dbSNP ensembl • 2.9k views
ADD COMMENT
1
Entering edit mode

See ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/README. You may want the "toplevel" version.

ADD REPLY
0
Entering edit mode

I downloaded Homo_sapiens.GRCh37.75.dna.toplevel.fa but it is lexicographically sorted.

ADD REPLY
1
Entering edit mode

If you really want something that requires no work to get working with GATK, you can download the GATK resource bundle.

ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle

Your choice of reference(s) will be limited, though.

ADD REPLY

Login before adding your answer.

Traffic: 2640 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6