Question

Forum:BWA command guide

8

Entering edit mode

6.1 years ago

Chen Sun ★ 1.1k

SYNOPSIS

bwa index ref.fa bwa mem ref.fa reads.fq > aln-se.sam bwa mem ref.fa read1.fq read2.fq > aln-pe.sam bwa aln ref.fa short_read.fq > aln_sa.sai bwa samse ref.fa aln_sa.sai short_read.fq > aln-se.sam bwa sampe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln-pe.sam bwa bwasw ref.fa long_read.fq > aln.sam

COMMANDS AND OPTIONS

index </TD>
bwa index [-p prefix] [-a algoType] <in.db.fasta>

Index database sequences in the FASTA format.

OPTIONS:

</TR>

-p STR </TD>
Prefix of the output database [same as db filename]

-a STR </TD>
Algorithm for constructing BWT index. Available options are:

</TR>
</TR>
</TABLE> </TD></TR>
</TABLE>
</TD></TR>

is </TD>
IS linear-time algorithm for constructing suffix array. It requires 5.37N memory where N is the size of the database. IS is moderately fast, but does not work with database larger than 2GB. IS is the default algorithm due to its simplicity. The current codes for IS algorithm are reimplemented by Yuta Mori.

bwtsw </TD>
Algorithm implemented in BWT-SW. This method works with the whole human genome.

mem </TD>

bwa mem [-aCHMpP] [-t

nThreads] [-k

minSeedLen] [-w

bandWidth] [-d

zDropoff] [-r

seedSplitRatio] [-c

maxOcc] [-A

matchScore] [-B

mmPenalty] [-O

gapOpenPen] [-E

gapExtPen] [-L

clipPen] [-U

unpairPen] [-R

RGline] [-v

verboseLevel]

db.prefix

reads.fq [mates.fq]
Align 70bp-1Mbp query sequences with the BWA-MEM algorithm. Briefly, the algorithm works by seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW).

If

mates.fq file is absent and option

-p is not set, this command regards input reads are single-end. If

mates.fq is present, this command assumes the

i-th read in

reads.fq and the

i-th read in

mates.fq constitute a read pair. If

-p is used, the command assumes the 2i-th and the (2i+1)-th read in

reads.fq constitute a read pair (such input file is said to be interleaved). In this case,

mates.fq is ignored. In the paired-end mode, the

mem command will infer the read orientation and the insert size distribution from a batch of reads.

The BWA-MEM algorithm performs local alignment. It may produce multiple primary alignments for different part of a query sequence. This is a crucial feature for long sequences. However, some tools such as Picard’s markDuplicates does not work with split alignments. One may consider to use option

-M to flag shorter split hits as secondary.

OPTIONS:

</TR>
</TR>
</TR>
</TR>
</TR>
</TR>
</TR>
</TR>
</TR>
</TR>
</TR>
</TR>
</TR>
</TABLE>
</TD></TR>

-t INT </TD>
Number of threads [1]

-k INT </TD>
Minimum seed length. Matches shorter than

INT will be missed. The alignment speed is usually insensitive to this value unless it significantly deviates 20. [19] </TD></TR>

-w INT </TD>
Band width. Essentially, gaps longer than

INT will not be found. Note that the maximum gap length is also affected by the scoring matrix and the hit length, not solely determined by this option. [100] </TD></TR>

-d INT </TD>
Off-diagonal X-dropoff (Z-dropoff). Stop extension when the difference between the best and the current extension score is above |i-j|*A+INT, where

i and

j are the current positions of the query and reference, respectively, and

A is the matching score. Z-dropoff is similar to BLAST’s X-dropoff except that it doesn’t penalize gaps in one of the sequences in the alignment. Z-dropoff not only avoids unnecessary extension, but also reduces poor alignments inside a long good alignment. [100] </TD></TR>

-r FLOAT </TD>
Trigger re-seeding for a MEM longer than

minSeedLen*FLOAT. This is a key heuristic parameter for tuning the performance. Larger value yields fewer seeds, which leads to faster alignment speed but lower accuracy. [1.5] </TD></TR>

-c INT </TD>
Discard a MEM if it has more than

INT occurence in the genome. This is an insensitive parameter. [10000] </TD></TR>

-P </TD>
In the paired-end mode, perform SW to rescue missing hits only but do not try to find hits that fit a proper pair.

-A INT </TD>
Matching score. [1]

-B INT </TD>
Mismatch penalty. The sequence error rate is approximately: {.75 * exp[-log(4) * B/A]}. [4]

-O INT </TD>
Gap open penalty. [6]

-E INT </TD>
Gap extension penalty. A gap of length k costs O + k*E (i.e.

-O is for opening a zero-length gap). [1] </TD></TR>

-L INT </TD>
Clipping penalty. When performing SW extension, BWA-MEM keeps track of the best score reaching the end of query. If this score is larger than the best SW score minus the clipping penalty, clipping will not be applied. Note that in this case, the SAM AS tag reports the best SW score; clipping penalty is not deducted. [5]

-U INT </TD>
Penalty for an unpaired read pair. BWA-MEM scores an unpaired read pair as scoreRead1+scoreRead2-INT and scores a paired as scoreRead1+scoreRead2-insertPenalty. It compares these two scores to determine whether we should force pairing. [9]

-p </TD>
Assume the first input query file is interleaved paired-end FASTA/Q. See the command description for details.

-R STR </TD>
Complete read group header line. ’\t’ can be used in

STR and will be converted to a TAB in the output SAM. The read group ID will be attached to every read in the output. An example is ’@RG\tID:foo\tSM:bar’. [null] </TD></TR>

-T INT </TD>
Don’t output alignment with score lower than

INT. This option only affects output. [30] </TD></TR>

-a </TD>
Output all found alignments for single-end or unpaired paired-end reads. These alignments will be flagged as secondary alignments.

-C </TD>
Append append FASTA/Q comment to SAM output. This option can be used to transfer read meta information (e.g. barcode) to the SAM output. Note that the FASTA/Q comment (the string after a space in the header line) must conform the SAM spec (e.g. BC:Z:CGTAC). Malformated comments lead to incorrect SAM output.

-H </TD>
Use hard clipping ’H’ in the SAM output. This option may dramatically reduce the redundancy of output when mapping long contig or BAC sequences.

-M </TD>
Mark shorter split hits as secondary (for Picard compatibility).

-v INT </TD>
Control the verbose level of the output. This option has not been fully supported throughout BWA. Ideally, a value 0 for disabling all the output to stderr; 1 for outputting errors only; 2 for warnings and errors; 3 for all normal messages; 4 or higher for debugging. When this option takes value 4, the output is not SAM. [3]

BWA • 39k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 6.1 years ago by Chen Sun ★ 1.1k

1

Entering edit mode

Worth also noting if you have access to GPUs (A10, A30, A40, A100, A6000) you can run BWA-MEM much faster on them with Parabricks fq2bam tool:

$ docker run \
    --gpus all \
    --rm \
    --volume /host/data:/input_data \
    --volume /host/results:/outputdir \
    --workdir /image/input_data \
    nvcr.io/nvidia/clara/clara-parabricks:4.0.0-1 \
    pbrun fq2bam \
    --ref /input_data/Homo_sapiens_assembly38.fasta \
    --in-fq /input_data/fastq1.gz /input_data/fastq2.gz \
    --out-bam /image/outputdir/fq2bam_output.bam