Trying to Identify larger indels from NGS data, FASTQ format
2
0
Entering edit mode
5.9 years ago

The data I have is sequencing of fragmented whole genomic DNA from several individuals of S. purpuratus (purple sea urchin), all as .FASTQ file format. I have an index of the genome to align and produce .bam output files. I have tried to identify indels, but only got a few small SNPs. I am looking for much larger insertions or deletions. I think that tophat might be detecting differences between the fastq file and the genome index only. I want to be able to detect indels across multiple fastq files. The differences (large size insertions and deletions) I want to detect are between the FASTQ files, not in the FASTQ versus the reference genome. Does anyone know of a good workflow to identify indels by simply aligning multiple fastq files?

RAD-seq DNA NGS FASTA FASTq • 2.3k views
ADD COMMENT
0
Entering edit mode

Tophat is likely not the right aligner here, as it was intended for spliced alignment of RNA-seq data, and also for that purpose it is deprecated and should be replaced by e.g. STAR.

For genomic DNA you probably should use bwa mem

ADD REPLY
0
Entering edit mode

Yes, as per Wouter, you should re-align your data using, e.g., bwa mem (if reads >70bp), and then you could use my answer with your aligned BAMs. TopHat is for RNA-seq reads and is a splice-aware aligner.

ADD REPLY
0
Entering edit mode
5.9 years ago

If you want a comprehensive 'indel finder', then look no further than pindel. It attempts to call insertions, deletions, inversions, duplications, and other structural variants. You will have to know your expected insert size for each sample.

Some sample commands to call structural variants and then report as VCF (NB - here I only call variants over the region 5:135,402,841-135,402,862):

pindel -f /ReferenceMaterial/1000Genomes/human_g1k_v37.fasta -i Sample8.txt -c 5:135,402,841-135,402,862 --number_of_threads 3 -o output/Sample8

The file Sample8.txt contains the BAM, expected insert size, and ID:

BAMoutput/Sample8.bam   350 Sample8

Then convert to VCF and tidy up:

pindel2vcf --pindel_output_root output/Sample8 -r /ReferenceMaterial/1000Genomes/human_g1k_v37.fasta -R 1000GenomesPilot-NCBI36 -d 20101123 --min_coverage 18 --het_cutoff 0.2 --hom_cutoff 0.8 --vcf output/Sample8.vcf
bgzip -f output/Sample8.vcf
tabix -f -p vcf output/Sample8.vcf.gz
bcftools view -Ov --exclude-uncalled --min-ac=1 output/Sample8.vcf.gz > output/Sample8.filt.vcf

Kevin

ADD COMMENT
0
Entering edit mode
5.0 years ago
geocarvalho ▴ 360

Another option is IMSindel, that according to the paper has good performance when compared to others like Pindel.

ADD COMMENT

Login before adding your answer.

Traffic: 2681 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6