snp calling using reference with no raw data
1
0
Entering edit mode
9.3 years ago
Yahan ▴ 400

I have to do a snp calling on 40 or so samples a few of which originate from public sources. For these the raw data is not available in all cases.

Therefore I thought of building a dummy fastq paired dataset by chopping the reference into pieces using a window approach to add some coverage.

Any thoughts on this?

I would remove all monomorphic calls for this sample, and apply default filters like snps in repeat regions and near-indel-snps.

An alternative would be to compare the reference on which mapping will be done with this reference using Mummer. But then I would have to integrate the calls into the vcf and snp calling metrics would be absent for this sample.

Neither of the two I like very much but I don't see an alternative really.

Thanks for any suggestion.

snp calling • 2.0k views
ADD COMMENT
0
Entering edit mode

Whats the goal of your analysis ? Do you want dummy SNPs for the samples where you don't have raw data ? Your question and approach is not clear.

ADD REPLY
0
Entering edit mode

Well, we want to know the genotype call for the raw-less sample(s), provided that the calls from the other samples are confident enough to accept the snp. We will then filter downstream for assay design taking the calls for all samples into account. So I would maybe not use the raw-less calls for quality filtering

The filtering will be done on the VCF so I need the calls in there.

removing the monomorphs will discard some true positives but we're not really interested in those, so that's not a big problem, except maybe that if they're in the flanking sequence they could hamper the assay.

I guess I have to weigh what's the most work, creating the dummy fastq or adding the calls done by direct reference comparison to the VCF.

ADD REPLY
0
Entering edit mode
9.3 years ago

If you would like to create raw fastq files, you need to replicate the wetlab protocol of that platform.

For e.g illumina-HiSeq, take the fasta file of genome and randomly fragmentit in to multiple chunks. Then get the sequences and size select them (like select the fragments which are of 300-500bp) and read 100bp from the both the ends. You may ignore the error rates and quality information for now. Window based approach is not correct as genomic DNA is randomly fragmented. Or simply use the existing NGS data simulator programs which takes care of all the parameters.

ADD COMMENT

Login before adding your answer.

Traffic: 2966 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6