Question

How To Find Missing Reads When A Sam File Result From An Extremely Big Fastq File Is Broken

0

Entering edit mode

10.4 years ago

dustar1986 ▴ 380

I'm afraid it's a silly question but please help.

I've got an extremely paried-end bisulfite-sequencing file (more than 1TB). I used bismark to map it to mm10. Unfortunately, the bismark finished running with a broken SAM file (the last line is incomplete).

It's quite a trouble to run the mapper again coz it cost about 3 weeks. I've checked that about 20% reads were not written into the SAM file. So now I have to find which reads are missing and map that small part again. The difficult part is that, reads in the SAM file are not as the same order as those in the FASTQ file.

What is the fastest way to compare the read IDs in the broken SAM file with IDs in the original FASTQ file? Or is there an alternative way to achieve it?

Really appreciate if anyone can help.

sam bowtie • 3.2k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 10.4 years ago by dustar1986 ▴ 380

0

Entering edit mode

I assume FASTQ file has sorted IDs. Now convert your SAM into BAM and sort it using queryname. This way you can kind of make out where in the fastq file aligner went wrong.

ADD REPLY • link 10.4 years ago by Ashutosh Pandey 12k

Ram · Answer 1 · 2013-11-22

Bismark will always output reads in the same order as they were input. What you're seeing as reads being output in a different order is actually just missing alignments due to reads not aligning with the bismark-set parameters (bismark is rather conservative). To get the last chunk of the fastq files, just samtools view alignment.bam | tail | cut -f 1 to get the last read name (e.g. HWI-ST568:174:D0LB2ACXX:8:1101:11539:2429), and then (for each of the fastq files):

zcat read_1.fastq.gz | \
  awk '
    BEGIN {found=0}
    {
    if(found==0) { 
      if(/HWI\-ST568:174:D0LB2ACXX:8:1101:11539:2429/) {found=1}
    }
    if(found==1) print $0
    }
  ' | \
  gzip -c > remainder_1.fastq.gz

Note that you have to escape (e.g., \-)any minus signs in the read name.

Edit: As an aside, if you have a bit of computational resources and are comfortable compiling things yourself, you might benefit from using bison rather than bismark. It's both faster (it can be 2-3x faster when given the same resources) and produces slightly more accurate results (though the difference is quite small in terms of accuracy). I hope to release an updated version early next-week that can also use an semi-arbitrary number of cluster nodes, in case you have a decent sized cluster (I'm working out the last bugs at the moment).

Ram · Answer 2 · 2014-09-07

1

Entering edit mode

9.6 years ago

Biomonika (Noolean) 3.2k

Just get two files - ids from fastq file and ids from SAM file. Then you can use comm command, e.g.

comm -13 file1 file2

(see man comm)

This way you can find out which reads are missing, extract them (e.g. with seqtk subseq) and map again.

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Biomonika (Noolean) 3.2k