Question

Identify reads that span intron-exon junctions in RNA-Seq

1

Entering edit mode

5.8 years ago

oriolebaltimore ▴ 190

Dear group, Given RNA-Seq data for either 1 or more than 1 different samples/conditions, I am interested in finding

A. an exon-intron or intron-exon junction with reads spanning these junctions.
B. Another similar question is identify intron-retention, where reads map exon-intron-exon.

A differs from B, where A coverage ends halfway in intron. There are algorithms such as iREAD for identifying intron-retention similar to B above.

Is there a way, say samtools or other tool where I can find intron-exon or exon-intron read overhang extension.

As an example, image is given for easy understanding. Thanks for your help.

-A

Image showing intron-exon read overhang for top sample, whereas for the rest of samples below no such read overhang or coverage is observed.

enter image description here

RNA-Seq • 5.7k views

ADD COMMENT • link updated 5.1 years ago by Malcolm.Cook ★ 1.5k • written 5.8 years ago by oriolebaltimore ▴ 190

0

Entering edit mode

did you look at:

?

ADD REPLY • link 5.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

and what about your previous questions. e.g: ~~Obtaining exon-intron and intron-exon reads~~ Mapping and annotating DNA binding regions from ChIP-Seq to nearby gene did you leave a comment or marked them as "accepted" ?

ADD REPLY • link 5.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre, Thanks for the reply. I will check this:Question: Extracting Intron-Exon Reads from bam files

For you second question, I did not post that question - but that is helpful.

thanks!

ADD REPLY • link 5.8 years ago by oriolebaltimore ▴ 190

0

Entering edit mode

Dear Alex, Thank you so much. worked beautifully!! Adrian

ADD REPLY • link 5.6 years ago by oriolebaltimore ▴ 190

0

Entering edit mode

IMHO, identifying and estimating the abundance of intron retention using exon-intron reads are unreliable at best and often lead to overestimation. In addition to introns having one or both splice sites alternatively spliced, such an approach is sensitive to parameters like overhang and mismatches used during the alignment step. I reckon a better way is to construct an intron database by using both the reference and any novel event identified from sequencing, and count/quantify reads in the bam against it.

ADD REPLY • link 5.1 years ago by Eric Lim ★ 2.1k

1

Entering edit mode

5.1 years ago

Malcolm.Cook ★ 1.5k

I have used featureCounts function from the Rsubread package for this express purpose and found it worked perfectly. I recommend it if you work in R because:

it gives you precise control over the extent of overlap you want to require for a read to be counted as overlapping your intron-exon junction
it interoperates with spliced alignment BAM files as produced most excellently and quickly by the STAR aligner
you can profitably pass it acceptor and/or donor coordinates using the SAF format (if you do, you will probably want to replace GENEID with JUNCTIONID)
it is quite fast, and supports multi-threading

ADD COMMENT • link 5.0 years ago by Malcolm.Cook ★ 1.5k

score 3 · Accepted Answer · 2018-07-10

Steps via awk and BEDOPS.

Build a BED file of exons:

$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.annotation.gtf.gz \
    | gunzip -c - \
    | awk '($3=="exon")' - \
    | gtf2bed - \
    | cut -f1-6 - \
    > gencode.v28.annotation.exons.c1t6.bed

Convert transcripts to merged exons:

$ awk -f transcripts2mergedExons.awk gencode.v28.annotation.exons.c1t6.bed > gencode.v28.annotation.mergedExons.c1t6.bed

Make an exon-intron list:

$ awk -f mergedExons2exonIntronList.awk gencode.v28.annotation.mergedExons.c1t6.bed > gencode.v28.annotation.exonsAndIntrons.c1t6.bed

Convert these to junctions:

$ awk -f exonIntronList2JunctionList.awk gencode.v28.annotation.exonsAndIntrons.c1t6.bed > gencode.v28.annotation.exonsIntronJunctions.c1t6.bed

Optionally, pad them, e.g. by 25 nt around the junction (adjust as needed):

$ bedops --everything --range 25 gencode.v28.annotation.exonsIntronJunctions.c1t6.bed > gencode.v28.annotation.exonsIntronJunctions.pad25.c1t6.bed

Map reads to the padded junctions:

$ bedmap --echo --count --delim '\t' gencode.v28.annotation.exonsIntronJunctions.pad25.c1t6.bed <(bam2bed < reads.bam) > answer.bed

The file answer.bed will contain the junction and the number of reads that map to — overlap with — the optionally-padded junction, by one or more bases.

If you want the actual reads that map to the exon-intron junctions:

$ bedmap --echo-map --delim '\t' gencode.v28.annotation.exonsIntronJunctions.pad25.c1t6.bed <(bam2bed < reads.bam) | sort-bed - > answer.bed

In this second example, the file answer.bed will contain the reads that map to the junction, by one or more bases.

Some links to Github Gists with listed awk scripts:

(transcripts2mergedExons.awk)
(mergedExons2exonIntronList.awk)
(exonIntronList2JunctionList.awk)