Biostar Beta. Not for public use.
Trying to Identify larger indels from NGS data, FASTQ format
0
Entering edit mode
2.1 years ago

The data I have is sequencing of fragmented whole genomic DNA from several individuals of S. purpuratus (purple sea urchin), all as .FASTQ file format. I have an index of the genome to align and produce .bam output files. I have tried to identify indels, but only got a few small SNPs. I am looking for much larger insertions or deletions. I think that tophat might be detecting differences between the fastq file and the genome index only. I want to be able to detect indels across multiple fastq files. The differences (large size insertions and deletions) I want to detect are between the FASTQ files, not in the FASTQ versus the reference genome. Does anyone know of a good workflow to identify indels by simply aligning multiple fastq files?

ADD COMMENTlink
0
Entering edit mode

Tophat is likely not the right aligner here, as it was intended for spliced alignment of RNA-seq data, and also for that purpose it is deprecated and should be replaced by e.g. STAR.

For genomic DNA you probably should use bwa mem

ADD REPLYlink
0
Entering edit mode

Yes, as per Wouter, you should re-align your data using, e.g., bwa mem (if reads >70bp), and then you could use my answer with your aligned BAMs. TopHat is for RNA-seq reads and is a splice-aware aligner.

ADD REPLYlink
0
Entering edit mode
11 months ago
Republic of Ireland

If you want a comprehensive 'indel finder', then look no further than pindel. It attempts to call insertions, deletions, inversions, duplications, and other structural variants. You will have to know your expected insert size for each sample.

Some sample commands to call structural variants and then report as VCF (NB - here I only call variants over the region 5:135,402,841-135,402,862):

pindel -f /ReferenceMaterial/1000Genomes/human_g1k_v37.fasta -i Sample8.txt -c 5:135,402,841-135,402,862 --number_of_threads 3 -o output/Sample8

The file Sample8.txt contains the BAM, expected insert size, and ID:

BAMoutput/Sample8.bam   350 Sample8

Then convert to VCF and tidy up:

pindel2vcf --pindel_output_root output/Sample8 -r /ReferenceMaterial/1000Genomes/human_g1k_v37.fasta -R 1000GenomesPilot-NCBI36 -d 20101123 --min_coverage 18 --het_cutoff 0.2 --hom_cutoff 0.8 --vcf output/Sample8.vcf
bgzip -f output/Sample8.vcf
tabix -f -p vcf output/Sample8.vcf.gz
bcftools view -Ov --exclude-uncalled --min-ac=1 output/Sample8.vcf.gz > output/Sample8.filt.vcf

Kevin

ADD COMMENTlink
0
Entering edit mode
12 months ago
geocarvalho • 110
Brazil/Recife

Another option is IMSindel, that according to the paper has good performance when compared to others like Pindel.

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1