Question

Trying to Identify larger indels from NGS data, FASTQ format

0

Entering edit mode

5.9 years ago

alice_pieplow • 0

The data I have is sequencing of fragmented whole genomic DNA from several individuals of S. purpuratus (purple sea urchin), all as .FASTQ file format. I have an index of the genome to align and produce .bam output files. I have tried to identify indels, but only got a few small SNPs. I am looking for much larger insertions or deletions. I think that tophat might be detecting differences between the fastq file and the genome index only. I want to be able to detect indels across multiple fastq files. The differences (large size insertions and deletions) I want to detect are between the FASTQ files, not in the FASTQ versus the reference genome. Does anyone know of a good workflow to identify indels by simply aligning multiple fastq files?

RAD-seq DNA NGS FASTA FASTq • 2.3k views

ADD COMMENT • link updated 5.0 years ago by geocarvalho ▴ 360 • written 5.9 years ago by alice_pieplow • 0

0

Entering edit mode

Tophat is likely not the right aligner here, as it was intended for spliced alignment of RNA-seq data, and also for that purpose it is deprecated and should be replaced by e.g. STAR.

For genomic DNA you probably should use bwa mem

ADD REPLY • link 5.9 years ago by WouterDeCoster 47k

0

Entering edit mode

Yes, as per Wouter, you should re-align your data using, e.g., bwa mem (if reads >70bp), and then you could use my answer with your aligned BAMs. TopHat is for RNA-seq reads and is a splice-aware aligner.

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

score 0 · Answer 1 · 2018-05-14

If you want a comprehensive 'indel finder', then look no further than pindel. It attempts to call insertions, deletions, inversions, duplications, and other structural variants. You will have to know your expected insert size for each sample.

Some sample commands to call structural variants and then report as VCF (NB - here I only call variants over the region 5:135,402,841-135,402,862):

pindel -f /ReferenceMaterial/1000Genomes/human_g1k_v37.fasta -i Sample8.txt -c 5:135,402,841-135,402,862 --number_of_threads 3 -o output/Sample8

The file Sample8.txt contains the BAM, expected insert size, and ID:

BAMoutput/Sample8.bam   350 Sample8

Then convert to VCF and tidy up:

pindel2vcf --pindel_output_root output/Sample8 -r /ReferenceMaterial/1000Genomes/human_g1k_v37.fasta -R 1000GenomesPilot-NCBI36 -d 20101123 --min_coverage 18 --het_cutoff 0.2 --hom_cutoff 0.8 --vcf output/Sample8.vcf
bgzip -f output/Sample8.vcf
tabix -f -p vcf output/Sample8.vcf.gz
bcftools view -Ov --exclude-uncalled --min-ac=1 output/Sample8.vcf.gz > output/Sample8.filt.vcf

Kevin

score 0 · Answer 2 · 2019-04-18

0

Entering edit mode

5.0 years ago

geocarvalho ▴ 360

Another option is IMSindel, that according to the paper has good performance when compared to others like Pindel.

ADD COMMENT • link 5.0 years ago by geocarvalho ▴ 360