Biostar Beta. Not for public use.
Bam : Extract Only Alignment With A Specific Length
1
Entering edit mode
13 months ago
Belgium, Brussels

Hi,

Is it possible to extract from a BAM file, only alignments with a specifi lentgh (ex: length > 60 ) ?

Thanks,

N.

ADD COMMENTlink
7
Entering edit mode
6.5 years ago
Marvin • 850

Assuming "length of alignment" is measured on the reference, this will do it for SAM:

perl -lane '$l = 0; $F[5] =~ s/(\d+)[MX=DN]/$l+=$1/eg; print if $l > 60'

Combine with samtools as needed, e.g:

samtools view -h foo.bam | perl -lane '$l = 0; $F[5] =~ s/(\d+)[MX=DN]/$l+=$1/eg; print if $l > 60 or /^@/' | samtools view -bS - > bar.bam
ADD COMMENTlink
5
Entering edit mode
16 months ago
France/Nantes/Institut du Thorax - INSE…

Here is a solution using the picard API:

import java.io.File;

import net.sf.samtools.SAMFileReader;
import net.sf.samtools.SAMFileWriter;
import net.sf.samtools.SAMFileWriterFactory;
import net.sf.samtools.SAMRecord;
import net.sf.samtools.SAMRecordIterator;
import net.sf.samtools.SAMFileReader.ValidationStringency;

public class Biostar12433 {

    public static void main(String[] args) throws Exception
        {
        if(args.length!=1) return;
        final int LENGTH=60;
        SAMFileReader inputSam=new SAMFileReader(new File(args[0]));
        inputSam.setValidationStringency(ValidationStringency.SILENT);
        SAMRecordIterator iter=inputSam.iterator();
        SAMFileWriterFactory sf=new SAMFileWriterFactory();
        sf.setCreateIndex(false);
        SAMFileWriter sw=sf.makeSAMWriter(inputSam.getFileHeader(),false,System.out);

        while(iter.hasNext())
            {
            SAMRecord rec=iter.next();
            if(rec.getReadUnmappedFlag()) continue;
            int length=rec.getAlignmentEnd()-rec.getAlignmentStart();
            if(length<LENGTH)
                {
                continue;
                }
            sw.addAlignment(rec);
            }
        iter.close();
        inputSam.close();
        sw.close();
        }

}

Compilation:

javac -cp path/to/sam-1.35.jar:path/to/picard-tools-1.35/picard-1.35.jar Biostar12433.java

Execution:

 java -cp path/to/sam-1.35.jar:path/to/picard-tools-1.35/picard-1.35.jar Biostar12433 file.bam > result.sam

Here I've used rec.getAlignmentEnd()-rec.getAlignmentStart(); but you could also use getUnclippedStart/getUnclippedEnd. See http://picard.sourceforge.net/javadoc/net/sf/samtools/SAMRecord.html

ADD COMMENTlink
0
Entering edit mode

nice! I am still in the world of parsing out the CIGAR strings, need to change that

ADD REPLYlink
0
Entering edit mode

6.5 years later: use samjdk : http://lindenb.github.io/jvarkit/SamJdk.html . Something like:

java -jar dist/samjdk.jar -e 'return record.getReadUnmappedFlag() || rec.getEnd()-rec.getStart()>=60;' in.bam
ADD REPLYlink
5
Entering edit mode
5.6 years ago
Sven Bilke • 70
Bethesda, MD, USA

Here is yet another version, in C, for a change. It implements different length functions and it should be fairly easy to accommodate to whatever your needs are.

#include <stdio.h>
#include <samtools/sam.h>


int len_query(const bam1_t *b) {
  // This one includes soft-clipped nucleotides
  return  bam_cigar2qlen(&b->core, bam1_cigar(b));
}

int len_genome(const bam1_t *b) {
  // length in genome-cooredinates
  return  bam_calend(&b->core, bam1_cigar(b)) - b->core.pos;
}

int len_align(const bam1_t *b)  {
  // pedestrian way, fill in whatever choice you like, currently caculates number of (non-insert, non-delete) 
  // aligned (match & missmatch) nt's of the query
  int i;
  uint32_t *cigar = bam1_cigar(b);
  int l  = 0;
  for(i=0; i < b->core.n_cigar; ++i) {
    const int opl  = cigar[i] >> BAM_CIGAR_SHIFT;
    const int op   = cigar[i] & BAM_CIGAR_MASK;
    if(op == BAM_CMATCH)   l+= opl;
  }
  return l;
}


int main(int argc, char **argv) {
  int (*len_func)(const bam1_t *);
  if(argc < 4) {
    fprintf(stderr, "Insuficient arguments, Usage: %s len mode in.bam out.bam\n", argv[0]);
    exit(-1);
  } 
  switch(*argv[2]) {
  case 'Q': 
    len_func = len_query;
    break;

  case 'A': 
    len_func = len_align;
    break;

  case 'G': 
    len_func = len_genome;
    break;

  default:
    fprintf(stderr, "unsupported mode, known are _Q_uery-length, _A_ligned length and _G_enomic length\n");
    exit(-1);
    break;
  }

  int len=atoi(argv[1]);

  samfile_t *I = samopen(argv[3], "rb", 0);
  if(!I) {
    printf("Can not open sam input %s\n", argv[3]);
    exit(-1);
  }
  samfile_t *O = samopen(argv[4], "wb", I->header);
  if(!O) {
    printf("Can not open sam output %s\n", argv[4]);
    exit(-1);
  }

  bam1_t *b = bam_init1();

  while (samread(I, b) >= 0) {
    int l = len_func(b);
    if(len != len_func(b)) continue;
    samwrite(O, b);
  }
  samclose(I);
  samclose(O);
  bam_destroy1(b);
  exit(0);
}
ADD COMMENTlink
0
Entering edit mode

--Hi,

nice usefull code, is there a way to add filters on chromosome and range such as: %s len mode chr range in.bam out.bam

thx

ADD REPLYlink
3
Entering edit mode
2.2 years ago
Ido Tamir 5.0k
Austria

Same as Pierres in Scala.

save as filter.scala

start with:

sh filter.scala in.bam out.bam 60

Because the problem is not really complex, it does not show off scalas benefits, apart from being able to call it as a script.

#!/bin/sh
cp='sam-1.50.jar;picard-1.50.jar'
exec scala -classpath $cp "$0" "$@"
!#

import scala.collection.JavaConverters._
import net.sf.samtools._
import net.sf.picard.filter._
import net.sf.picard.io.IoUtil
import net.sf.samtools.util.CloserUtil
import net.sf.samtools.SAMFileReader.ValidationStringency
import java.io.File

class SamFilteredCopy(inName: String, outName: String, filterOut: SAMRecord => Boolean) {
     val input = new SAMFileReader(new File(inName))
     val output = if(outName == "stdout"){
         new SAMFileWriterFactory().makeSAMWriter(input.getFileHeader, false, System.out)
     }else{
         val outFile = new java.io.File(outName)
         IoUtil.assertFileIsWritable(outFile)
         new SAMFileWriterFactory().makeSAMOrBAMWriter(input.getFileHeader, false, outFile)
     }
     input.setValidationStringency(ValidationStringency.SILENT)
     for(record <- input.iterator.asScala if filterOut(record)){
         output.addAlignment(record)            
     }  
     CloserUtil.close(input)
     CloserUtil.close(output)
}

def filter(length: Int)(record: SAMRecord): Boolean = {
    record.getAlignmentEnd - record.getAlignmentStart + 1 > length 
}

if(args.length < 3){
    error("need three arguments: inName, outName (stdout), size")
}
new SamFilteredCopy(args(0),args(1),filter(args(2).toInt) _)
ADD COMMENTlink
0
Entering edit mode

Love the scala use in bioinformatics. Wish it was more prevalent!

ADD REPLYlink
1
Entering edit mode
2.9 years ago
Ketil 3.9k
Germany

I think this depends on the aligner you use. E.g. bwa will report (AFAICS) the full sequence, and use soft clipping at the ends, while gsMapper will use hard clipping, and report only the matched sequence.

I think you will have to parse the CIGAR field and count (sum up) the number of M's.

ADD COMMENTlink
0
Entering edit mode

I used bowtie with -S option and then samtools to convert sam to bam

ADD REPLYlink
0
Entering edit mode

You can't just count the M operators - at very least you must also include the X and = characters (mismatch and match), which are a more explicit alternative to M (match or mismatch). In fact you probably should count the I operations too.

It might be simpler to take the read's sequence and deduct any soft trimming (the number of S operations in the CIGAR string)?

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1