Find common reads between two bam files
1
0
Entering edit mode
5.1 years ago
yp19 ▴ 70

Hi all! I'm trying to find common reads between two bam files. I first sorted the bam files by read name and then used comm (https://linux.die.net/man/1/comm). Here is my code:

samtools sort -n unmapped_reads1.bam -o unmapped_reads1_sorted.bam
samtools sort -n unmapped_reads2.bam -o unmapped_reads2_sorted.bam

comm -12 unmapped_reads1_sorted.bam unmapped_reads2_sorted.bam > common_seqs.bam

Comm spits out these two messages:

comm: file 2 is not in sorted order

comm: file 1 is not in sorted order

and the resulting output file (common_seqs.bam) is empty. Anyone experienced this? or recommend any other ways to do this task?

Thank you!

sequencing bam reads • 3.0k views
ADD COMMENT
2
Entering edit mode
5.1 years ago

the sort order of samtools sort is not the same as the linux sort / comm

https://github.com/samtools/samtools/blob/develop/bam_sort.c#L1840

samtools sort calls strnum_cmpwhich is not a simple comparaison of bytes :

static int strnum_cmp(const char *_a, const char *_b)
{
    const unsigned char *a = (const unsigned char*)_a, *b = (const unsigned char*)_b;
    const unsigned char *pa = a, *pb = b;
    while (*pa && *pb) {
        if (isdigit(*pa) && isdigit(*pb)) {
            while (*pa == '0') ++pa;
            while (*pb == '0') ++pb;
            while (isdigit(*pa) && isdigit(*pb) && *pa == *pb) ++pa, ++pb;
            if (isdigit(*pa) && isdigit(*pb)) {
                int i = 0;
                while (isdigit(pa[i]) && isdigit(pb[i])) ++i;
                return isdigit(pa[i])? 1 : isdigit(pb[i])? -1 : (int)*pa - (int)*pb;
            } else if (isdigit(*pa)) return 1;
            else if (isdigit(*pb)) return -1;
            else if (pa - a != pb - b) return pa - a < pb - b? 1 : -1;
        } else {
            if (*pa != *pb) return (int)*pa - (int)*pb;
            ++pa; ++pb;
        }
    }
    return *pa? 1 : *pb? -1 : 0;
}

you want:

samtools view unmapped_reads1.bam |  cut -f1 | LC_ALL=C sort | uniq >   unmapped_reads1_sorted.bam
samtools view unmapped_reads2.bam |  cut -f1 | LC_ALL=C sort | uniq >   unmapped_reads2_sorted.bam

LC_ALL=C comm -12 unmapped_reads1_sorted.bam unmapped_reads2_sorted.bam > common_seqs.bam
ADD COMMENT
0
Entering edit mode

When I use the sorting commands, my resulting sorted bam files are a series of lines like this (not the typical bam format with sequences of reads)

MN00234:32:000H2LGW3:1:11101:10000:11153
MN00234:32:000H2LGW3:1:11101:10000:15424
MN00234:32:000H2LGW3:1:11101:10000:17282
MN00234:32:000H2LGW3:1:11101:10000:18247
MN00234:32:000H2LGW3:1:11101:10000:6127
MN00234:32:000H2LGW3:1:11101:10000:6852
MN00234:32:000H2LGW3:1:11101:10000:7439
MN00234:32:000H2LGW3:1:11101:10001:11174
MN00234:32:000H2LGW3:1:11101:10001:16085
MN00234:32:000H2LGW3:1:11101:10001:18520
MN00234:32:000H2LGW3:1:11101:10002:12632
MN00234:32:000H2LGW3:1:11101:10002:13716
MN00234:32:000H2LGW3:1:11101:10002:14294
MN00234:32:000H2LGW3:1:11101:10002:14998
MN00234:32:000H2LGW3:1:11101:10002:16729
MN00234:32:000H2LGW3:1:11101:10002:17066
MN00234:32:000H2LGW3:1:11101:10002:6528
MN00234:32:000H2LGW3:1:11101:10002:7756
ADD REPLY
0
Entering edit mode

yes, that's what you want if you want to use 'comm'...

ADD REPLY
0
Entering edit mode

Oh i see! I should be more clear. I want my output to still be .bam just with all of the common sequences

ADD REPLY
1
Entering edit mode

once you have the common read names you can go back to the bam and get the reads: Extract reads from bam/sam files using read id

ADD REPLY
0
Entering edit mode

I will try that thank you!!!

ADD REPLY
0
Entering edit mode

define "common sequences"

ADD REPLY
0
Entering edit mode

my apologies for my bad explaining. I started off with paired end sequencing data. I mapped these sequences to two different reference genomes separately, and each time I extracted only the unmapped reads. (those two reference genomes represent possible contaminant species).

So now i have two unmapped reads files (unmapped_reads1.bam, unmapped_reads2.bam). I would need to keep sequences that are common to BOTH files as my final output, in bam format.

ADD REPLY

Login before adding your answer.

Traffic: 3115 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6