How To Count Stand-Specific Paired-End Rna-Seq Reads Overlapping Known Protein Coding Genes ?
1
0
Entering edit mode
10.3 years ago
biorepine ★ 1.5k

Dear Biostars

Does any one how to overlap stand-specific paired-end RNA-Seq reads (BAM) with known protein coding genes (BED) ?

I tried the following but I think it is not the correct way ? Would appreciate your help!

bamTobed -i ES.bam > ES.bed 
intersectBed -a ES.bed -b Ensembl_mm9.bed -wa -s |awk '!a[$4]++' |wc -l
paired-end rna-seq overlap • 3.6k views
ADD COMMENT
1
Entering edit mode

Why don't you just make your life easier and use featureCounts or htseq-counts? BTW, intersectBed can take a BAM file as input (use -abam instead of -a).

ADD REPLY
0
Entering edit mode

I think both packages that you mentioned take gff format but not BED.

ADD REPLY
0
Entering edit mode

Exactly, just download the GTF or GFF file for mm9 (or the Ensembl annotation, since it's unclear which you're using) instead of making a BED file out of things.

ADD REPLY
0
Entering edit mode

but i have my own BED files that custom made like novel transcripts.

ADD REPLY
1
Entering edit mode
10.3 years ago

If you must use a BED file and intersectBed:

 intersectBed -a ES.bed -b Ensembl_mm9.bed -wb -s | awk '{a[$10]++}END{for(idx in a) {print idx,a[idx]}}'

Keep in mind that intersectBed shouldn't be used to count RNAseq reads, since it will increment the counter for a feature even if a read maps to not just it but another feature. As an example:

>cat foo.bed
chr1    0    100    read1    .    +
chr1    0    100    read2    .    -
chr1    50    150    read3    .    +
chr1    100    200    read4    .    +

>cat bar.bed
chr1    0    100    target1    .    +
chr1    0    100    target2    .    -
chr1    20    120    target3    .    +

>intersectBed -a foo.bed -b bar.bed -wb -s | awk '{a[$10]++}END{for(idx in a) {print idx,a[idx]}}'
target1 2
target2 1
target3 3

Only target2 should have a count (or target1 should have 1 and target3 should have 2, depending on how you overlap things). You could script around this, but it's faster to just use a different tool.

Edit: I can recommend featureCounts (from Subread) and also htseq-count for this. Making a GTF should be relatively straight-forward. Just increment the second column, set the 4th as the gene_id and transcript_id and shuffle around the remainder (adding some "." columns).

ADD COMMENT
0
Entering edit mode

i just voted your feature count answer. May be include that in answer as well. it could be useful to others. I will try to see whether i can convert my BED into GTF and run the tools you suggested. Great thanx anyways!!

ADD REPLY
0
Entering edit mode

I've added that bit to the answer and provided some links. I've never needed to convert a BED file to GTF, but it should be relatively straight-forward (I mentioned briefly how to do it in my update). Keep in mind that BED files don't have any way of representing spliced features, so I hope you don't have anything that's spliced (if you do and each of the exons has the same label, then you can still deal with that, just use python or perl and sort the input).

ADD REPLY

Login before adding your answer.

Traffic: 1511 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6