How To Find Missing Reads When A Sam File Result From An Extremely Big Fastq File Is Broken
2
0
Entering edit mode
10.4 years ago
dustar1986 ▴ 380

I'm afraid it's a silly question but please help.

I've got an extremely paried-end bisulfite-sequencing file (more than 1TB). I used bismark to map it to mm10. Unfortunately, the bismark finished running with a broken SAM file (the last line is incomplete).

It's quite a trouble to run the mapper again coz it cost about 3 weeks. I've checked that about 20% reads were not written into the SAM file. So now I have to find which reads are missing and map that small part again. The difficult part is that, reads in the SAM file are not as the same order as those in the FASTQ file.

What is the fastest way to compare the read IDs in the broken SAM file with IDs in the original FASTQ file? Or is there an alternative way to achieve it?

Really appreciate if anyone can help.

sam bowtie • 3.2k views
ADD COMMENT
0
Entering edit mode

I assume FASTQ file has sorted IDs. Now convert your SAM into BAM and sort it using queryname. This way you can kind of make out where in the fastq file aligner went wrong.

ADD REPLY
1
Entering edit mode
10.4 years ago

Bismark will always output reads in the same order as they were input. What you're seeing as reads being output in a different order is actually just missing alignments due to reads not aligning with the bismark-set parameters (bismark is rather conservative). To get the last chunk of the fastq files, just samtools view alignment.bam | tail | cut -f 1 to get the last read name (e.g. HWI-ST568:174:D0LB2ACXX:8:1101:11539:2429), and then (for each of the fastq files):

zcat read_1.fastq.gz | \
  awk '
    BEGIN {found=0}
    {
    if(found==0) { 
      if(/HWI\-ST568:174:D0LB2ACXX:8:1101:11539:2429/) {found=1}
    }
    if(found==1) print $0
    }
  ' | \
  gzip -c > remainder_1.fastq.gz

Note that you have to escape (e.g., \-)any minus signs in the read name.

Edit: As an aside, if you have a bit of computational resources and are comfortable compiling things yourself, you might benefit from using bison rather than bismark. It's both faster (it can be 2-3x faster when given the same resources) and produces slightly more accurate results (though the difference is quite small in terms of accuracy). I hope to release an updated version early next-week that can also use an semi-arbitrary number of cluster nodes, in case you have a decent sized cluster (I'm working out the last bugs at the moment).

ADD COMMENT
0
Entering edit mode

Thanks Dpryan, this is quite helpful.

I'd really like to try Bison when I get rid of the current issue. Thanks again.

ADD REPLY
1
Entering edit mode
9.6 years ago

Just get two files - ids from fastq file and ids from SAM file. Then you can use comm command, e.g.

comm -13 file1 file2

(see man comm)

This way you can find out which reads are missing, extract them (e.g. with seqtk subseq) and map again.

ADD COMMENT

Login before adding your answer.

Traffic: 1441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6