Question

Ngs Challenge: Can You Give An Insight About This Imaginary Genetic Disease ?

21

Entering edit mode

12.3 years ago

Pierre Lindenbaum 161k

I'm not sure Biostar is the right place to post this question, nevertheless I thought it would be interesting and fun to submit the challenge I proposed to my students last week for a course about NGS.

The following URL links to a zip (53Mo) containing 3 pairs of FASTQ files: http://dl.dropbox.com/u/18871518/data.zip

$ unzip -t data.zip 
Archive:  data.zip
    testing: father.dir/genome_1.fq.gz   OK
    testing: father.dir/genome_2.fq.gz   OK
    testing: mother.dir/genome_1.fq.gz   OK
    testing: mother.dir/genome_2.fq.gz   OK
    testing: child.dir/genome_1.fq.gz   OK
    testing: child.dir/genome_2.fq.gz   OK
No errors detected in compressed data of data.zip

The terms of the problem:

You've been sent the FASTQs of the exomes of 3 related humans: A father, a mother and their child. They have been sequenced for the study of a autosomal monogenic recessive genetic disease. Only the child is affected by the disease. Can you give me an insight about the origin of this disease?

Hint n°1: it's on the chr22

Hint n°2: the child has never been affected by a gastroenteritis (Yep, that's the phenotype ! :-) )

PS: My students were not very comfortable with the command line and unix so they haven't been able to find the final answer after 3H30 but I hope my fastqs are ok...

The most upvoted/elegant solution wins on saturday 24th.

alt text

Image via WP Commons

next-gen-sequencing • 4.6k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 12.3 years ago by Pierre Lindenbaum 161k

2

Entering edit mode

I can understand that your students were a bit uncomfortable, I would think this is not quite trivial, even though it is quite obvious what you have to look for. What I find confusing is hint No 2: the child has never been affected by a gastroenteritis (Yep, that's the phenotype ! :-) ) vs. "Only the child is affected by the disease." What phenotype are you talking about, so there are at least two phenotypes? what do you mean by "that's the phenotype", implying there is only a single phenotype in play?

ADD REPLY • link 12.3 years ago by Michael 54k

1

Entering edit mode

This is a fun problem, but I'd point out that normal people are quite busy the week before Christmas :)

ADD REPLY • link 12.3 years ago by Neilfws 49k

1

Entering edit mode

So... what are you saying about my normality?

ADD REPLY • link 12.3 years ago by Brad Chapman 9.7k

0

Entering edit mode

Nice assignment. Sounds fun.

ADD REPLY • link 12.3 years ago by Damian Kao 16k

0

Entering edit mode

Can we have a bed file?

ADD REPLY • link 12.3 years ago by Zev.Kronenberg 12k

0

Entering edit mode

How long are these reads...? BWA sampe is choking

ADD REPLY • link 12.3 years ago by Zev.Kronenberg 12k

0

Entering edit mode

erm... is this real data?

ADD REPLY • link 12.3 years ago by Russh ★ 1.2k

0

Entering edit mode

Are there reads from other chromosomes? Little more info please.

ADD REPLY • link 12.3 years ago by Zev.Kronenberg 12k

0

Entering edit mode

the quality scores are all the same. It's probably not real data :)

ADD REPLY • link 12.3 years ago by Damian Kao 16k

0

Entering edit mode

@zev.kronenberg wy do you want a BED ? what kind of error did you get with BWA

ADD REPLY • link 12.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

@russH no , the files have been generated using samtools wgsim

ADD REPLY • link 12.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

@zev.kronenberg you can limit your search to the human chr22 hg19

ADD REPLY • link 12.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

@DK: not real data: the reads have been generated with samtools wgsim, I've inserted some mutations in the fasta files of the 3 individuals.

ADD REPLY • link 12.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

@Pierre_Lindenbaum

1) I guess I don't need a bed if its fake data

2)

[zkronenb@browning child.dir]$ ~/tools/bwa-0.5.9/bwa sampe /data/ucsc/hg19/fasta/chr22.fa genome_1.sai genome_2.sai genome_1.fq genome_2.fq [bam_header_read] invalid BAM binary header (this is not a BAM file). [bam_header_read] invalid BAM binary header (this is not a BAM file). @SQ SN:chr22 LN:51304566 @PG ID:bwa PN:bwa VN:0.5.9-r16 [bwa_read_seq] the maximum barcode length is 15.

ADD REPLY • link 12.3 years ago by Zev.Kronenberg 12k

0

Entering edit mode

@Michael: Well, they were not alone. I was here to tell and explain them the commands to run. But it was not simple for them to simply move a file to another directory, using the tab-completion, etc....

ADD REPLY • link 12.3 years ago by Pierre Lindenbaum 161k

score 10 · Answer 1 · 2011-12-19

Pierre -- this is a brilliant idea. Having these sort of real life worked examples is a great learning experience for students. It's a nice microcosm of bioinformatics work:

Cleaning up input data

Some of the supplied input data had non-unique names, which would make aligners and other downstream analysis unhappy. Specifically the mother and father data was all named 'chr22.' I used the fastx_renamer from the FASTX toolkit to clean it up. Here is the renamed data:

Cleaned fastq input data

Applying open source tools and pipelines

I performed alignment and variant calling with bwa and GATK using my open-source variant calling pipeline (described in more detail in this blog post). With this YAML configuration file and the fastq directory, you run the pipeline using:

distributed_nextgen_pipeline.py global_config.yaml fastq run_info.yaml
                                -n <num cores>

This produces annotated VCF file with predicted variant effects:

Variant call VCF files

Writing custom code to analyze the data

Given variant calls for mother, father and child I wanted to identify variants with these characteristics:

A reliable call (not filtered by GATK filters)
Caused a potentially problematic change in a coding region
Was heterozygous in the father/mother (recessive carriers) and homozygous in the child

I wrote a small Clojure program using the GATK VCF library to identify these. The code is here.

Following up with the biologists who generated the data

This gives me one candidate variation in a gene with some relationship to the phenotype. I'll keep my guess quiet until later on to avoid ruining the experience for others who want to dig into it.

But I'm also not entirely positive about my results given the problem description, so I've got a couple of questions for Pierre:

Which family members have the phenotype? In the variation I found both the father and child are homozygous, so I'd expect them both to show the phenotype.
Are we expecting an obvious change? I identified a non synonymous change in a gene with an OMIM relationship related to the phenotype, but didn't follow up deeper to see how the mutation disrupts protein function. So if this were real data I'd want to dig lots further to be more sure.
Could you release the process you used to generate the input data? It would be awesome to be able to do this for other genes so you could use this in teaching environments without students being able to get the answer via internet searching.

Edit:

Pierre -- thanks for the updates and the gene to look at. I'm not able to reconstruct the expected phenotypes even knowing where to look. Here are the variant calls in that region grepped out:

https://gist.github.com/1507630

It looks like the father is homozygous for a nonsense variation with a premature stop codon, the mother is heterozygous for a missense amino acid change and the child has two heterozygous variations at those positions. I'd expect the father to have the issue and the child to have the issue if you knew the variations were on different alleles. I didn't apply haplotype phasing but they are close enough they could be on the same read so read-based phasing could pick this up.

Edit 2:

I applied GATK's Read-backed haplotype phasing on these and updated the VCF files and gist. The child looks exactly right: the two errors are phased on alternate haplotypes so you'd expect no fully functional copies.

So the only confusing part with respect to your intention is the father's genotype. GATK genotyping identifies that position as homozygous with the non-sense allele. The pileup does have some evidence for both alleles, but weighs towards more As (the non-sense allele) versus Ts (the working allele):

chr22    41742029    N    23    TATTTAATtAAAAAAAaAAAAAt    23000330033333331333332

That's a tricky call. I can see why GATK predicted it that way but also see an argument for being heterozygous. So another useful bioinformatics lesson: digging deeply into your results can reveal a lot of hidden complexity.

score 2 · Answer 2 · 2011-12-29

to complete Brad's answer. Here is a part of my solution. Here I have used samtools mpileup rather than the GATK and some of my (experimental) tools.

Ok, first here is the Makefile that was used to generate the data for my students (the coverage is low as I didn't want to generate some large files and I forgot to discard the duplicates.

sam_dir=/usr/local/package/samtools-0.1.18
bwa_dir=/usr/local/package/bwa-0.6.0
bwa_bin=${bwa_dir}/bwa
sam_bin=${sam_dir}/samtools
KROM=chr22
VARKIT=/home/lindenb/src/variationtoolkit/bin
MAKEFILE=-f 20111205.mk
MPILEUP=## -B -e 10

#http://theory.uwinnipeg.ca/localfiles/infofiles/make/make_36.html
.SECONDARY:

%.txt.gz : %.txt
    gzip -f --best $<
%.fq.gz : %.fq
    gzip -f --best $<

%_1.fq %_2.fq:%.fa
    ${sam_dir}/misc/wgsim -N 500000 -r 0  -h $<  $(subst .fa,_1.fq,$<) $(subst .fa,_2.fq,$<)
    sed "s/^@.*/@${KROM}/"  $(subst .fa,_1.fq,$<) >  $(subst .fa,.tmp.fq,$<)
    mv $(subst .fa,.tmp.fq,$<) $(subst .fa,_1.fq,$<)
    sed "s/^@.*/@${KROM}/"  $(subst .fa,_2.fq,$<) >  $(subst .fa,.tmp.fq,$<)
    mv $(subst .fa,.tmp.fq,$<) $(subst .fa,_2.fq,$<)

%.sai:%.fq.gz
    ${bwa_bin} aln ${KROM}db $< > $@

%.fai:%.fa
    ${sam_bin} faidx $<

%.sam:%_1.sai %_2.sai
    echo $<
    ${bwa_bin} sampe -a 700 ${KROM}db  \
        $(subst .sam,_1.sai,$@) \
        $(subst .sam,_2.sai,$@) \
        $(subst .sam,_1.fq.gz,$@) \
        $(subst .sam,_2.fq.gz,$@) > $@

%.bam:%.sam
    ${sam_bin} view -b -T ${KROM}.fa  $< > $@.tmp
    ${sam_bin} sort $@.tmp $@.tmp.sorted
    mv $@.tmp.sorted.bam $@
    ${sam_bin} index $@

%.bcf:%.bam
    ${sam_bin} mpileup ${MPILEUP} -uf ${KROM}.fa $(subst .bcf,.bam,$@) |\
        ${sam_dir}/bcftools/bcftools view -bvcg - > $@
%.vcf:%.bcf
    ${sam_dir}/bcftools/bcftools view $< > $@

${KROM}.fa:
    ${sam_dir}/samtools faidx /GENOTYPAGE/data/pubdb/ucsc/hg19/chromosomes/hg19.fa ${KROM} > ${KROM}.fa
    ${sam_dir}/samtools faidx ${KROM}.fa
    ${bwa_bin} index -a bwtsw -p ${KROM}db ${KROM}.fa

snps.txt:
    mysql -N -u anonymous -D hg19 -e 'select distinct chrom,chromStart+1,"A","A",func from snp132 where chrom="chr22" and chromStart > 21304567 and chromStart < 31742029 and func="coding-synon" order by avHetSE limit 10' | head > $@

ignore.txt:
    echo -e "chr22\t41696567\t41757151" > $@

capture.txt:
    mysql -N -u anonymous -D hg19 -e "select chrom,exonCount,exonStarts,exonEnds from knownGene where chrom='${KROM}' and cdsStart<cdsEnd and="" txStart=""> 21304567 order by chrom " |\
    awk -F '    ' '{n=int($$2); split($$3,B,"[,]");split($$4,E,"[,]"); for(i=1;i<=n;++i) printf("%s\t%d\t%d\n",$$1,int(B[i])-350,int(E[i])+350);}' |\
    ${VARKIT}/bedmerge > $@

father.dir: capture.txt ${KROM}.fa snps.txt  ignore.txt
    ${VARKIT}/genomesim -o $@.tar -r 0.0001 -i ignore.txt -f capture.txt -u snps.txt -m chr22 41742029 A . ${KROM}.fa
    tar xvf  $@.tar
    rm $@.tar
    cat  $@/homologous1.fa $@/homologous2.fa > $@/genome.fa
    $(MAKE) $(MAKEFILE) $@/genome.vcf
    grep 41742029 $@/genome.vcf

mother.dir: capture.txt ${KROM}.fa snps.txt  ignore.txt
    ${VARKIT}/genomesim -o $@.tar -r 0.0001  -i ignore.txt -f capture.txt -u snps.txt  -m chr22 41742047 G . ${KROM}.fa
    tar xvf  $@.tar
    rm $@.tar
    cat  $@/homologous1.fa $@/homologous2.fa > $@/genome.fa
    $(MAKE) $(MAKEFILE) $@/genome.vcf
    grep 41742047 $@/genome.vcf

child.dir: father.dir mother.dir
    mkdir $@
    cat father.dir/homologous1.fa mother.dir/homologous1.fa > $@/genome.fa
    ${sam_dir}/misc/wgsim -N 500000 -r 0.00001 -h $@/genome.fa  $@/genome_1.fq $@/genome_2.fq >  $@/wgsim.tsv
    $(MAKE) $(MAKEFILE) $@/genome.vcf
    -grep 41742029 $@/genome.vcf
    -grep 41742047 $@/genome.vcf

basic variants effect prediction with ucsc knownGenes:

$ cat father.dir/genome.vcf  mother.dir/genome.vcf child.dir/genome.vcf  |\
cut -d '      ' -f 1,2,3,4,5 | grep -v "##"  |\
${VARKIT}/prediction --host localhost --port 3316 --user anonymous -f /path/to/hg19.fa |\
cut -d '  ' -f 1-20 | sort | uniq > predictions.txt

read each VCF, append the name of the SAMPLE, extract the field GT and discard the variation if it is homozygous, create a new key (chrom_pos_ref_alt):

echo "MOTHER\tmother.dir/genome.vcf\nFATHER\tfather.dir/genome.vcf\nCHILD\tchild.dir/genome.vcf" > vcf2sample.txt

${VARKIT}/scanvcf vcf2sample.txt | tr -d '\r' |\
${VARKIT}/extractformat -t GT |\
awk '(!($$11=="FATHER" && $$12=="1/1"))' |\
awk '(!($$11=="MOTHER" && $$12=="1/1"))' |\
awk '(!($$11=="CHILD" && $$12=="1/1"))' |\
awk '{printf("%s_%s_%s_%s\t%s\n",$$1,$$2,$$4,$$5,$$0);}' |\
sort -t '   ' -k1,1 > tmp1.vcf

create a new key (chrom_pos_ref_alt) for the predictions, keep the non synonymous and the stops gained/deleted:

awk '{printf("%s_%s_%s_%s\t%s\n",$$1,$$2,$$4,$$5,$$0);}' predictions.txt |\
tr -d '\r' | sort -t '  ' -k1,1 > tmp2.vcf

join the two files with the key chrom_pos_ref_alt, sort, group by gene, keep the gene having at least one mutation in the father/mother/child, keep the ucsc name, extract the gene symbols using ucsc kgXref, search the ncbi for those genes and get some information about those genes:

join -t '   ' -1 1 -2 1 tmp1.vcf tmp2.vcf | egrep '(#CHROM|NON_SYNO|STOP)' | cut -d '   ' -f2- |\
sort -t '   ' -k1,1 -k2,2 -k4,4 -k5,5 -k11 |\
${VARKIT}/groupbygene --gene 18  --sample 11 |\
awk '($$1=="GENE" || ($$7!="0" && $$8!="0" && $$9!="0"))' |\
cut -d '    ' -f 1|\
${VARKIT}/mysqlquery --host genome-mysql.cse.ucsc.edu  --user genome -e 'select geneSymbol from kgXref where kgId="$$1"' |\
sort | uniq |\
awk '($$2!=".")' |\
${VARKIT}/ncbiesearch -D gene -q '"$$2"[GENE] "Homo Sapiens"[ORGN]' |\
${VARKIT}/ncbiefetch -D gene  -c 3

Result:

uc002zww.2  BCR 613 BCR breakpoint cluster region       HGNC=1014|Ensembl=ENSG00000186716|HPRD=01044|MIM=151410|Vega=OTTHUMG00000150655 A reciprocal translocation between chromosomes 22 and 9 produces the Philadelphia chromosome, which is often found in patients with chronic myelogenous leukemia. The chromosome 22 breakpoint for this translocation is located within the BCR gene. The translocation produces a fusion protein which is encoded by sequence from both BCR and ABL, the gene at the chromosome 9 breakpoint. Although the BCR-ABL fusion protein has been extensively studied, the function of the normal BCR gene product is not clear. The protein has serine/threonine kinase activity and is a GTPase-activating protein for p21rac. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Jul 2008]
uc002zwx.2  BCR 613 BCR breakpoint cluster region       HGNC=1014|Ensembl=ENSG00000186716|HPRD=01044|MIM=151410|Vega=OTTHUMG00000150655 A reciprocal translocation between chromosomes 22 and 9 produces the Philadelphia chromosome, which is often found in patients with chronic myelogenous leukemia. The chromosome 22 breakpoint for this translocation is located within the BCR gene. The translocation produces a fusion protein which is encoded by sequence from both BCR and ABL, the gene at the chromosome 9 breakpoint. Although the BCR-ABL fusion protein has been extensively studied, the function of the normal BCR gene product is not clear. The protein has serine/threonine kinase activity and is a GTPase-activating protein for p21rac. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Jul 2008]
uc003azw.2  ZC3H7B  23264   ZC3H7B  zinc finger CCCH-type containing 7B     HGNC=30869|Ensembl=ENSG00000100403|HPRD=06711|Vega=OTTHUMG00000150969   This gene encodes a protein that contains a tetratricopeptide repeat domain. The encoded protein also interacts with the rotavirus non-structural protein NSP3. [provided by RefSeq, Jul 2008]
uc003biz.3  CRELD2  79174   CRELD2  cysteine-rich with EGF-like domains 2       HGNC=28150|Ensembl=ENSG00000184164|HPRD=08455|MIM=607171|Vega=OTTHUMG00000150292    
uc003bja.2  CRELD2  79174   CRELD2  cysteine-rich with EGF-like domains 2       HGNC=28150|Ensembl=ENSG00000184164|HPRD=08455|MIM=607171|Vega=OTTHUMG00000150292    
uc010haj.2  CRELD2  79174   CRELD2  cysteine-rich with EGF-like domains 2       HGNC=28150|Ensembl=ENSG00000184164|HPRD=08455|MIM=607171|Vega=OTTHUMG00000150292    
uc010hak.2  CRELD2  79174   CRELD2  cysteine-rich with EGF-like domains 2       HGNC=28150|Ensembl=ENSG00000184164|HPRD=08455|MIM=607171|Vega=OTTHUMG00000150292    
uc010hal.2  CRELD2  79174   CRELD2  cysteine-rich with EGF-like domains 2       HGNC=28150|Ensembl=ENSG00000184164|HPRD=08455|MIM=607171|Vega=OTTHUMG00000150292    
uc010ham.2  CRELD2  79174   CRELD2  cysteine-rich with EGF-like domains 2       HGNC=28150|Ensembl=ENSG00000184164|HPRD=08455|MIM=607171|Vega=OTTHUMG00000150292

pubmed tells us (Rotavirus / gastroenteritis) that ZC3H7B is a good candidate and it is confirmed by looking at the variations chr22:41742029 and chr22:41742047 . We can find the domain of the protein affected by those mutations using uniprot (here, a zinc finger) :

$ grep uc003azw.2 predictions.txt |\
 ${VARKIT}/mysqlquery -q 'select spId from kgXref where kgId="$6"' |\
 ${VARKIT}/uniprot -p 14 -s 21

chr22   41742029    .   T   A   uc003azw.2  +   41697566    41756150    41716664    41753433    EXON|EXON_STOP_GAINED   1481    494 Exon 14 .   TGT TGA C   *   Q9UGR2-2    1   993 chain   .   Zinc finger CCCH domain-containing protein 7B   .   .
chr22   41742047    .   C   G   uc003azw.2  +   41697566    41756150    41716664    41753433    EXON|EXON_CODING_NON_SYNONYMOUS 1499    500 Exon 14 .   TGC TGG C   W   Q9UGR2-2    1   993 chain   .   Zinc finger CCCH domain-containing protein 7B   .   .
chr22   41742047    .   C   G   uc003azw.2  +   41697566    41756150    41716664    41753433    EXON|EXON_CODING_NON_SYNONYMOUS 1499    500 Exon 14 .   TGC TGG C   W   Q9UGR2-2    500 524 zinc finger region  .   C3H1-type 1 .   .