Question

Hg19 Versus Grch37

5

Entering edit mode

12.7 years ago

Sam ▴ 90

Can anyone explain why these two chromosome 1 files are different (that to others as well)? I'm under the impression hg19 and GRC37 are the same reference genomes, but it looks like the hg19 version has a bunch of leading NNN placeholders that can affect searching the two genomes by position.

ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh37.p2_chr1.fa.gz

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz

---> head chr1.fa

chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

-----> head hs_ref_GRCh37.p2_chr1.fa

gi|224514618|ref|NT_077402.2| Homo sapiens chromosome 1 genomic contig, GRCh37.p2 reference primary assembly TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAA CCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCT AACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTA ACCCTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAACCCCTAACCCTAACCCTAACC CTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTA ACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCG CCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTACCACCGAAATCTGTGCAGAGGAC AACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTTGCAAAGG

Thank you!

genome hg • 31k views

ADD COMMENT • link updated 10.5 years ago by Biostar 20 • written 12.7 years ago by Sam ▴ 90

0

Entering edit mode

I meant this file: ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_01/hs_ref_GRCh37.p2_chr1.fa.gz

Looks like my question has been answered though by posting the "correct" link

ADD REPLY • link 12.7 years ago by Sam ▴ 90

score 5 · Answer 1 · 2011-08-26

5

Entering edit mode

12.7 years ago

lh3 33k

You must "head" a wrong file. Please do that again. hsrefGRCh37.p2_chr1.fa has lots of "N" bases at the beginning.

EDIT: GRC distributes the reference genome in two versions: one as contigs and the other as assembled chromosomes. The latter is in the "assembled_chromosome" directory. I do not know who are using the contigs, but nearly everyone I know use assembled chromosomes only.

ADD COMMENT • link 12.7 years ago by lh3 33k

0

Entering edit mode

ahh you're right. However, here was the file I was doing the "head" command on: ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_01/hs_ref_GRCh37.p2_chr1.fa.gz

That file is found in a different directory, same name though?

ADD REPLY • link 12.7 years ago by Sam ▴ 90

0

Entering edit mode

Thanks btw for your help

ADD REPLY • link 12.7 years ago by Sam ▴ 90

0

Entering edit mode

That file contains contigs, not chromosomes. Different things.

ADD REPLY • link 12.7 years ago by lh3 33k

score 3 · Answer 2 · 2011-08-26

The files are not different - they contain identical sequence and both begin with a run of 'N' characters.

For future reference, here's a quick way to count the bases:

# chr1
tail -n+2 chr1.fa | awk '{ for ( i=1; i<=length; i++ ) arr[substr($0, i, 1)]++ }END{ for ( i in arr ) { print i, arr[i] } }'
A 32485284
N 23970000
C 23064132
T 32559153
G 23070958
a 33085607
c 23960280
g 23945604
t 33109603

# hs_ref_GRCh37.p2_chr1
tail -n+2 hs_ref_GRCh37.p2_chr1.fa | awk '{ for ( i=1; i<=length; i++ ) arr[substr($0, i, 1)]++ }END{ for ( i in arr ) { print i, arr[i] } }'
A 65570891
N 23970000
C 47024412
G 47016562
T 65668756

If you do some summing, you'll see that both files contain exactly the same count for each base and the total length is 249,250,621.

Jeremy Leipzig · Answer 3 · 2011-08-26

0

Entering edit mode

12.7 years ago

Allpowerde ★ 1.3k

There are some differences here and there (e.g. "chr"-omitted and the annotations seems to be 0-indexed vs. 1-indexed) but nothing as drastic as you showed. It is still a major pain though.

GATK has taken clear sides with making b37 their primary ref-genome but PLINK and GWAS study are hg19 based. So I guess we have to deal with this issue a little bit longer.

ADD COMMENT • link updated 11.7 years ago by Jeremy Leipzig 22k • written 12.7 years ago by Allpowerde ★ 1.3k

0

Entering edit mode

thank you for your help

ADD REPLY • link 12.7 years ago by Sam ▴ 90