Karyotypically Ordered Hg19
3
3
Entering edit mode
12.8 years ago

Hi,

I have a .bam file generated using the hg19 release of human genome downloaded from UCSC database. Now I'm trying to indentificate genomic variants using GATK UnifiedGenotyper but it returns an error cause my bam file is Lexicographically sorted. I've tried to use Picard ReorderSam function but I've noted that my reference genome file (hg19) is in the same lexicographic order.

Is there any way to convert my hg19 fasta file in the karyotypic order? Or is there any place where I can download a version of the reference human genome sorted in this way?

Thanks

hg reference gatk • 12k views
ADD COMMENT
5
Entering edit mode
10.5 years ago

The reference FASTAs downloadable from Ensembl are not karyotypically sorted, but that can be fixed with a little bit of Perl...

Download and unzip the Build37 (hg19) reference FASTA file from Ensembl release 72:

curl -LO ftp://ftp.ensembl.org/pub/release-72/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.72.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh37.72.dna.primary_assembly.fa.gz

Use this Perl one-liner that splits the FASTA file per chrom/contig into a temporary folder, and then concatenates them in karyotypic order:

perl -e 'use File::Temp qw/tempdir/; use IO::File; $d=tempdir; $fh; map{if(m/^\>(\S+)\s/){$fh=IO::File->new(">$d/$1.fa");} print $fh $_;}`cat Homo_sapiens.GRCh37.72.dna.primary_assembly.fa`; foreach $c(1..22,X,Y,MT){print `cat $d/$c.fa`}; print `cat $d/GL*`' > Homo_sapiens.GRCh37.72.dna.primary_assembly.reordered.fa
ADD COMMENT
0
Entering edit mode

Hi, thanks for your code. I have numerous scaffolds that begin with KI as well as GL. eg 26 KI270728.1 1872759 3091464192 27 KI270727.1 448248 3093561344 28 KI270442.1 392061 3094085632 29 KI270729.1 280839 3094609920 30 GL000225.1 211173 3095134208 31 KI270743.1 210658 3095396352 32 GL000008.2 209709 3095658496 33 GL000009.2 201709 3095920640 34 KI270747.1 198735 3096182784 35 KI270722.1 194050 3096444928 36 GL000194.1 191469 3096707072 37 KI270742.1 186739 3096969216 38 GL000205.2 185591 3097231360 39 GL000195.1 182896 3097493504 40 KI270736.1 181920 3097755648 41 KI270733.1 179772 3098017792 42 GL000224.1 179693 3098279936 43 GL000219.1 179198 3098542080

How do I concatenate these in the correct order? I tried print cat $d/KI*&GL*' > GRCh38.reordered.fa but got an error -


GL*: not found

ADD REPLY
0
Entering edit mode

Anything within backticks are simply shell commands that run and return with their standard output for Perl to parse. So in this case, you just add one more argument to the Unix cat command. Like this - cat $d/GL* $d/KI*

Btw, I don't recommend using in-line Perl in a production workflow. Besides, good bioinformatics tools should be able to handle FASTAs without karyotypically ordered chromosomes.

ADD REPLY
2
Entering edit mode
12.8 years ago

Broad makes one available from their FTP site:

ftp://gsapubftp-anonymous:@ftp.broadinstitute.org/bundle/5974/hg19/ucsc.hg19.fasta.gz

(Just hit enter if asked for a password)

They also have hg18 plus lots of other supporting data:

http://www.broadinstitute.org/gsa/wiki/index.php/GATK_resource_bundle

ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/5974/

ADD COMMENT
1
Entering edit mode
12.8 years ago

you can create your human genome by running the following shell:

for C in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M
do
   curl "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${C}.fa.gz" |\
   gunzip -c >> hg19.fa
done
ADD COMMENT
3
Entering edit mode

Please, please include unlocalized and unplaced contigs. They are part of the primary human assembly.

ADD REPLY
3
Entering edit mode

How does GATK define 'karyotypic' sorting? Where should unlocalised contigs etc be placed? The question is also relevant when using other naming conventions e.g. Ensembl.

ADD REPLY
1
Entering edit mode

Can I please reask Travis's question? For some reason Ensembl releases their 'toplevel' genome file in a format that sorts the chromosomes, patches and unlocalised contigs in order of ascending length which, depending on the definition of 'karyotypic' sorting, may (or may not) be the opposite of what GATK requires.

ADD REPLY
0
Entering edit mode

open a new question, citing that one.

ADD REPLY

Login before adding your answer.

Traffic: 1516 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6