1000 Genomes Human Reference Assembly File With Non Actgn Chars In Chr 3, Why?
1
3
Entering edit mode
12.9 years ago

I am using the 1000 genomes human assembly reference file

wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz

and It was working OK with bwa but not with other programs that expected only ACTGN letters.

Seems that in chr3 there is a M in one line and two R in another line.

 $  perl -lnE 'print if /[^ACTGN]/' human_g1k_v37.fasta
>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
>2 dna:chromosome chromosome:GRCh37:2:1:243199373:1
>3 dna:chromosome chromosome:GRCh37:3:1:198022430:1
CGCTACATAGCTGMCTTATTATTCGTGGTCCCCTATGACCCCCTGATCATTTTCCCTGAG
CCRRGCTTGGTTCTAACAATGAATTTAATAAGAATTGTATTTAATCAATGTTTAAATATA
>4 dna:chromosome chromosome:GRCh37:4:1:191154276:1
>5 dna:chromosome chromosome:GRCh37:5:1:180915260:1
>6 dna:chromosome chromosome:GRCh37:6:1:171115067:1
>7 dna:chromosome chromosome:GRCh37:7:1:159138663:1
>8 dna:chromosome chromosome:GRCh37:8:1:146364022:1
>9 dna:chromosome chromosome:GRCh37:9:1:141213431:1
[...]

Assuming that they are ambiguity IUPAC code, why they are there if it is the reference (isn't GRCh37 haploid)? and why only there? why not to put an N? Is that on purpose?

genome human • 3.9k views
ADD COMMENT
2
Entering edit mode
12.9 years ago

The reference genome for : hg19_dna range=chr3:60830521-60830580 5'pad=0 3'pad=0 is:

CGCTACATAGCTG*N*CTTATTATTCGTGGTCCCCTATGACCCCCTGATCATTTTCCCTGAG

your sequence:

CGCTACATAGCTG*M*CTTATTATTCGTGGTCCCCTATGACCCCCTGATCATTTTCCCTGAG

the 'M' (A or C: your sample was heterozygous) was a 'N' on the reference sequence.

2nd: the reference sequence

CCNNGCTTGGTTCTAACAATGAATTTAATAAGAATTGTATTTAA

your sequence:

CCRRGCTTGGTTCTAACAATGAATTTAATAAGAATTGTATTTAATCAATGTTTAAATATA

again the reference contains two 'N' on the reference sequence.

ADD COMMENT
2
Entering edit mode

I have accepted your answer for rewarding your time in answering. I have now a confirmation from 1000genomes researchers that this seems an error propagated from former genome reference sequences. They will fix the sequence in the ftp but is tricky because the bam files headers contain the md5 checksum of the reference sequences.

ADD REPLY
1
Entering edit mode

thanks Pierre, but the problem is that this sequence is suppose to be the HUMAN GENOME REFERENCE prepared and used for the 1kg. So why it doesn't have the original Ns?

ADD REPLY
1
Entering edit mode

It is clear that the ambiguities remain but always as an N (I assume) in the reference that is haploid.

ADD REPLY
0
Entering edit mode

because there was a strong ambiguity when the reference was sequenced. Some ambiguities remain.

ADD REPLY

Login before adding your answer.

Traffic: 1853 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6