annotation of duplicate reads
2
1
Entering edit mode
9.7 years ago
ifreecell ▴ 220

Hi there,

I have a lot of duplicate reads in my RNA-seq, which is great from my view. My question is is there any way to get the coordinates of these duplicate reads (> 10 reads) from a BAM/SAM file? It will be perfect to be able to annotate these regions with overlapping or flanking genes and output the results in a bed format. Any suggestions are appreciated, but solutions using R packages are more preferred.

Here are some artificial example datasets (might be deleted after one year) to be tested.

bed RNA-Seq BAM annotation duplicate-reads • 3.0k views
ADD COMMENT
1
Entering edit mode

If you are only interested in retrieving list of genes which have lots of duplicate reads then just compare the read count files generated using bam file before and after removal or flagging of duplicate reads. Read count softwares like HTseq will consider overlapping reads towards the counts. You can then compare these two count files and extract genes that show reduction of more than 10 counts between these two files. Retrieving the intergenic regions that show duplication is not that straightforward but can be accomplished using bedtools.

ADD REPLY
0
Entering edit mode

What have you tried? It'd be possible to do this in any language with convenient access to BAM files (C, python, R, java, etc.).

ADD REPLY
0
Entering edit mode

Could you just give some clue?

ADD REPLY
1
Entering edit mode

The general structure would be to read in a coordinate sorted BAM file and then create a buffer in a language of your choosing to hold one alignment. You'd then iterate through the alignments checking if a given alignment matches that in the buffer. If it does, increment a counter. If the alignment doesn't match that in the buffer, then look at the counter and see if there are enough duplicates to note the location. This would then be repeated. This is easy enough with single-end reads, but can be a bit more complicated with paired-end reads, depending on how you want to define duplicates.

ADD REPLY

Login before adding your answer.

Traffic: 2052 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6