Filter reads with most frequent mapping coordinates from BED
1
0
Entering edit mode
5.4 years ago

My bed file is like:

head -n3 test.bed
    Contig_68   1   53  HISEQ:244:C9PLYANXX:6:1101:2559:2541    60  +
    Contig_68   2   53  HISEQ:244:C9PLYANXX:6:1102:15469:35000  60  +
    Contig_68   1   53  HISEQ:244:C9PLYANXX:6:1102:15983:77622  60  +
    Contig_68   1   53  HISEQ:244:C9PLYANXX:6:1102:15114:79444  60  +

For each contig, I want to keep only the reads corresponding to the most frequent similar mapping coordinates. Therefore, here, I want to remove the 2nd read. I have written the code below, which does exactly what I want, but it is extremely slow when dealing with millions of reads. I was wondering if I it could be rewritten to make it faster. In a downstream step I will filter out in a SAM file reads which are not in "to_keep". The overall purpose is to remove possible paralogous sequences from ddRADSeq reads.

awk '{print $1}' test.bed | uniq | while read bed
#filter bed by contig xth
contig=$(awk -v pat=$bed '$1==pat' test.bed)
#add a 7th column with cat of start and end of mapping positions
contig=$(paste <(echo "$contig") <(echo "$(echo "$contig" | awk '{print $2$3}')"))
#retrieve the most abundant coordinates
t=$(echo "$contig" | awk '{print $7}'| sort |uniq -c | sort -r | head -1 | xargs | cut -d' ' -f2)
#filter read names with to the most frequent coordinates
echo "$contig" | awk -v pat=$t '$7==pat''{print $4}' >>to_keep
done
next-gen bed radseq assembly • 978 views
ADD COMMENT
2
Entering edit mode
5.4 years ago
ATpoint 81k

My approach first identifies those fragments/reads per contig that are most abundant. In the first sort, also use tee to write the sorted input file to disk, will be needed in the second step:

sort -k1,1 -k2,2n -k3,3n test.bed | tee test_sorted.bed | cut -f1-3 | uniq -c | sed "s/^[ \t]*//" | tr " " "\t" | sort -nrk1,1 -k2,2n | cut -f2-4 | awk -F "\t" '!_[$1]++' | sort -k1,1 -k2,2n > toKeep.bed

Then, use bedtools intersect to intersect the sorted input file (saved with tee, see above), and the toKeep.bed, keeping only those reads from test.bed that are in toKeep.bed:

bedtools intersect -a test_sorted.bed -b toKeep.bed -sorted -f 1.0 -F 1.0 > answer.bed

As I do not really have a realistic dataset to test this, I am not sure how fast and efficient it is. Please give feedback how it performs.

ADD COMMENT
1
Entering edit mode

Did it work efficiently?

ADD REPLY
0
Entering edit mode

Thanks a lot for your input. It actually did work perfectly! I integrated it into my pipeline to finally filter the corresponding reads from my original BAM file:

bedtools intersect -a sorted.bed -b to_keep.bed -sorted -f 1.0 -F 1.0 | sort -k1.16,1n | awk '{print $4}' > reads
java -jar picard.jar FilterSamReads I=to_be_filtered.bam O=filt.bam READ_LIST_FILE=reads FILTER=includeReadList
ADD REPLY

Login before adding your answer.

Traffic: 1622 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6