Question

Illumina Ngs Test Data For Snp/Indel Calling And Association Study

1

Entering edit mode

12.9 years ago

Travis ★ 2.8k

Hi all,

I am currently attempting to create a pipeline for Illumina NGS sequence alignment, SNP/Indel-calling and association testing upon multiple samples.

Can anyone recommend a data set for testing a pipeline like this, right through to the stage of calling associated mutations?

Alternatively, is there an existing means of generating such a data set?

Any tips would be greatly appreciated.

Thanks in advance!

next-gen sequencing snp association • 3.6k views

ADD COMMENT • link updated 11.0 years ago by marcodpc ▴ 60 • written 12.9 years ago by Travis ★ 2.8k

score 3 · Answer 1 · 2011-06-02

3

Entering edit mode

12.9 years ago

User 59 13k

Why not use the publicly available 1000genomes data (or subset thereof)?

http://www.1000genomes.org/data

They have sequence data from various platforms, including Illumina. You can work directly on the sequence data, or pick the pre-aligned BAM files to work with.

If you want to speed up throughput for pipeline development, just subset the data into a single chromosome and its associated mapped reads.

EDIT: OK didn't see the part about association testing. I'm not sure what the requirements would be for that in terms of sample numbers/size, but you might still be able to leverage the 1000Genomes datasets that exist.

ADD COMMENT • link 12.9 years ago by User 59 13k

0

Entering edit mode

A good suggestion. At it's very simplest I just want to run through a workflow with a few samples (say 2 phenotypes with 5 members per group and known causal mutations present). I know this isn't scientifically correct - it's simply to get a feel for the workflow and filtering of unassociated mutations etc.

ADD REPLY • link 12.9 years ago by Travis ★ 2.8k

score 0 · Answer 2 · 2011-06-02

Looking at papers in the area, I guess one approach would be to download or generate a few sets of reads and then artificially introduce a known causal mutation or two into the groups as appropriate. It is an over-simplification but still useful for test purposes I believe. Exactly how to introduce the mutation into the data sets is another question entirely I suppose!

score 0 · Answer 3 · 2013-05-03

0

Entering edit mode

11.0 years ago

marcodpc ▴ 60

If I use 1000genome data, can I simulate pooled sample in some way and generate in it SNP&indel ?

ADD COMMENT • link 11.0 years ago by marcodpc ▴ 60