Biostar Beta. Not for public use.
Question: Does BWA mem ever discard or modify data
0
Entering edit mode

I could not find a definite answer to this question. Does BWA mem modify the data it receives? I understand that it adds data (alignment of course, and more), but am wondering if the data in the fastq is guaranteed to survive.

The program takes in read names, a sequence string, and a quality string. Does it ever modify any of those?

The reason I ask is I am looking to compress our fastq data by running BWA, discarding as many unneeded sam columns and tags, and compressing to CRAM. I have run various tests, and it seems to be working fine, but wanted to know for sure.

[ed: clarify question and goals]

ADD COMMENTlink 20 months ago e.benn • 110
Entering edit mode
2

It is not the right question - BWA produces a SAM format - so the question should be "Is the sequence information in a SAM format identical to the FASTQ sequence" - to which the answer is NO. The SAM format represents information on the forward strand, hence sequences mapping on the reverse strand will be reverse complemented in the SAM format.

Perhaps the question is whether the FASTQ information could always be reconstructed from the SAM file - to which is the answer is again that not always. If the alignment is reported hard clipped then the clipped sequence will not be in the SAM file.

ADD REPLYlink 20 months ago
Istvan Albert
80k
Entering edit mode
1

running BWA, discarding all but the name, sequence and quality

So why don't you archive fastq in the first place? I don't see the reason to align and then discard the work the aligner has done. If you really want to store unaligned CRAM instead of fastq then you can convert fastq to unaligned SAM with e.g. picard/FastqToSam.

ADD REPLYlink 20 months ago
dariober
10k
Entering edit mode
0

discarding all but the name, sequence and quality, and compressing with CRAM

In fact, he can't do what he wants, because CRAM is an alignment format, and the compression is based on the alignment to the reference genome. If he discards everything but the name, sequence and quality, he can't save as CRAM.

ADD REPLYlink 20 months ago
h.mon
25k
Entering edit mode
0

Yes very good, I also need to keep the alignment, and the 'which pair-end' metadata, and probably some more I have forgotten.

ADD REPLYlink 20 months ago
e.benn
• 110
3
Entering edit mode

I believe you can keep all information from a fastq file if you use bwa mem with default parameters. The resulting sam will contain the unmapped reads, and only one copy of the mapped reads will contain the "primary" flag - additional mappings will be marked as "secondary". By default, bwa does soft clipping of primary reads, so all sequence and quality is kept. And even if original fastq read names are edited - removing /1 and /2, or whatever other convention used - the reads will contain the "first in pair" and "second in pair" flags, so read names can be recovered in full.

A great tool to check my claim is BamHash.

edit: this is not to say storing reads as CRAM files is a good idea, but some people think for long-term storage indeed it is. See CRAM vs BAM for references, and for some benchmarks:

https://www.uppmax.uu.se/support/user-guides/using-cram-to-compress-bam-files/

http://www.htslib.org/benchmarks/CRAM.html

ADD COMMENTlink 20 months ago h.mon 25k
5
Entering edit mode

What you might be looking for is uBAM - This is used in GATK's production pipeline to attach metadata to samples as early as possible, however the Broad's scale of data that they're analysing is far more than most places. The biggest caveat to this approach is outlined by Brian Bushnell (Author of bbmap), on that thread, in that gzipped fastq is much more practical in terms of (de)compression efficiency and resulting storage footprint, not to mention that by forcing your reads to the SAM spec, you may be losing original read names, and that can cause problems with tools that assume the original illumina read names (ie, paired end data could be parsed as single end).

I was thinking along the same lines as you a few years ago, and opted against uBAM and stuck with gzipped fastq, however saying that, I recall a thread about loseless fastq compression here. The tool (Alapy) that @Petr Ponomarenko outlined is available here, and there's a fantastic discussion on the thread of benchmarking and feature requests, worth a read.

Additional methodologies out there are binary fastq as @chris.bird mentioned:

  • @John's uQ tool
  • @Brian Bushnell's Clumpify tool (from bbmap)

Overall, my advice is to avoid uBAM, and check out lossless compression methods like alapy, uQ, or Clumpify, and see what best fits your needs, as there will be tradeoffs in terms of memory usage, and time to execute per file.

ADD COMMENTlink 20 months ago andrew.j.skelton73 5.7k
Entering edit mode
0

Thanks Andrew for that detailed answer, I will have a close look at all this

ADD REPLYlink 20 months ago
e.benn
• 110
Entering edit mode
0

No matter what compressed storage format you use, the majority of the entropy in sequence data is actually in base quality scores, not the actual sequences. If you want 50%+ improvement over fq.gz then you'll be forced to discard information - base quality score binning will make a bigger difference to your storage footprint that your choice between fq.gz, bam and cram. Whether this level of loss is acceptable depend on your requirements.

ADD REPLYlink 20 months ago
d-cameron
♦ 2.0k
0
Entering edit mode

it can clip bases, you'll want to relax the defaults

reads could also be filtered if they don't map to the reference genome

given that a bam file has more info than a fastq file, this probably won't work

search on binary fastq for a better solution

ADD COMMENTlink 20 months ago chris.bird • 0
Entering edit mode
0

it can clip bases, you'll want to relax the defaults

The bases are soft clipped so the information is still there. Only supplementary alignments are hard clipped but the information is in the primary alignment anyway.

reads could also be filtered if they don't map to the reference genome

Unmapped reads are retained in the bwa output. They just have the read unmapped SAM flag set.

given that a bam file has more info than a fastq file, this probably won't work

bwa alignment does not remove any data from the input fastq. Even the record ordering is retain in the bwa output.

ADD REPLYlink 20 months ago
d-cameron
♦ 2.0k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0