Biostar Beta. Not for public use.
Efficiently extract alignments with NH:i:1 field
1
Entering edit mode
15 months ago
rubic • 180
United States

Pretty basic question.

I have bam files in which the only information that indicates whether an alignment is unique is it's NH field. I'm interested in keeping only unique alignments, hence only alignments with NH:i:1 field. Can samtools do this? If not is there any other efficient way other than using grep?

Thanks

ADD COMMENTlink
0
Entering edit mode

Grep is pretty darn efficient. I can't think of a faster way to look at each read than to look at each read. Fundamentally any alternative would do the same.

ADD REPLYlink
0
Entering edit mode

Would grep really be as efficient as using samtools view if that was an option?

ADD REPLYlink
0
Entering edit mode

I meant to run samtools view and pipe through grep. If samtools has a filter flag for your tag of interest that would be a tiny bit faster. If your tag of interest is obscure then you've got to grep it. grep has a non-regex fixed mode that is very fast, but nothing is going to read a BAM faster than samtools view.

ADD REPLYlink
0
Entering edit mode

let's assume my aligner writes the sam output to standard output. Is there a one-liner that would pipe and grep and this directly into a bam file (including the header)?

For example, if to pipe my alignment output to bam I use:

align | samtools view -Sb - >

How can I change it to pipe only alignments with NH = 1?

ADD REPLYlink
1
Entering edit mode

This should work, but there are probably faster ways...

align <fastq | grep -P '^@|NH:i:1\b' | samtools view -Sb - > <bam>
ADD REPLYlink
0
Entering edit mode

Thanks thackl. It is indeed pretty slow. If anyone is following and knows a samtools command which is faster for achieving this that'll be great.

ADD REPLYlink
2
Entering edit mode

If you want bam2bam the following grep is about 10x faster than my previous idea. Roughly 1 minute / GB bam on my system. Depending on what type of alignments are in your bam, the samtools filter flags (no unmapped, only primary, no supplement - these cannot be unique mappings anyway) might speed up things as well.

(samtools view -H in.bam; samtools view -F 2308 in.bam | grep -w 'NH:i:1') | samtools view -bS - > out.bam
ADD REPLYlink
1
Entering edit mode
Pardon my ignorance in linux. Is the full command you're suggesting this:


align <fastq> | samtools view -Sb (samtools view -H in.bam; samtools view -F 2308 in.bam | grep -w 'NH:i:1') | samtools view -bS - > out.bam 
ADD REPLYlink
2
Entering edit mode
14 months ago
United States

No idea about how this compares speed-wise, but python is a bit more flexible and easier to use when filtering flags, I find. The pysam packages uses a wrapper around samtools to read in and write out BAM files.

I wrote a script once for checking all kinds of flags, tried to strip it down to what you might be interested in (haven't tested it though)

#!/usr/bin/env python

import sys
import argparse
import getopt
import pysam
import os

def get_args():
    parser=argparse.ArgumentParser(description='filter BAM files')
    parser.add_argument('--BAMfile', '-in', type=str, required=True, help="BAM file")
    parser.add_argument('--outfile', '-out', type=str, required=True, help = 'Name for output file')

    args=parser.parse_args()
    return args

def remove_multiAlignments(ReadTagsEntry, KeepInfo):
    '''Checks for NH and XS tag to remove reads that aligned more than once'''

    keepRead = KeepInfo

    if 'XS' in ReadTagsEntry:
        keepRead=False
    elif 'NH' in ReadTagsEntry and ReadTagsEntry[1] > 1:
        keepRead=False

    return(keepRead)

def main():
    args = get_args()

    infile = pysam.Samfile(args.BAMfile, "rb")
    out = pysam.Samfile(args.outfile , "wb", template = infile )

    keep = True

    for DNAread in infile.fetch():
        intags = DNAread.tags

         for entry in intags:
            keep = remove_multiAlignments(entry, keep)

        if keep:
            out.write(DNAread)

        keep = True

    out.close()

if __name__ == '__main__':
main()
ADD COMMENTlink
0
Entering edit mode
14 months ago
5heikki 8.4k
Finland
I don't remember the exact flags but basically samtools view -bS yourFile | awk '$X != ""' where x corresponds to the column that shouldn't be empty..
ADD COMMENTlink
0
Entering edit mode
3.9 years ago
abl0719 • 0

The filter function from Bamtools (https://github.com/pezmaster31/bamtools) may be a better choice here. Since the bam file is no decompressed, it should be faster than grep.

ADD COMMENTlink
0
Entering edit mode

Good idea, but actually, on my computer, bamtools is about 3 times slower than grep:

$ time (bamtools filter -in example.bam -tag NM:1 > /dev/null )

real    0m36.750s
user    0m36.512s
sys 0m0.216s

$ time (samtools view -h example.bam | grep '^@|NM:i:0' > /dev/null)

real    0m12.197s
user    0m14.432s
sys 0m0.788s

(I ran the commands multiple times, so the BAM file is already in the cache memories of whichever component of the computer its operating system.)

I am using the bamtools binary from Debian 8 (Jessie).

$ bamtools --version

bamtools 2.3.0
Part of BamTools API and toolkit
Primary authors: Derek Barnett, Erik Garrison, Michael Stromberg
(c) 2009-2012 Marth Lab, Biology Dept., Boston College

$ dpkg -l bamtools
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                      Version           Architecture      Description
+++-=========================-=================-=================-========================================================
ii  bamtools                  2.3.0+dfsg-2      amd64             toolkit for manipulating BAM (genome alignment) files
ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3