Question

Probesets In Introns

0

Entering edit mode

12.2 years ago

polarise ▴ 380

Hi All,

I need to get the probesets for HuEx_1.0stv2 that map to hg18 introns. Specifically, I only want regions that are strictly intronic (not containing exonic bases of other transcripts for each gene). I know of exonmap and X:map. Exonmap requires that I install Ensembl (unimaginarily huge task) and X:map does not have hg18.

Any clues?

Regards,

Paul K

intron hg ensembl • 2.3k views

ADD COMMENT • link updated 12.2 years ago by Neilfws 49k • written 12.2 years ago by polarise ▴ 380

0

Entering edit mode

Assuming you mean introns defined by UCSC HG18 genes.

ADD REPLY • link 12.2 years ago by Neilfws 49k

Ram · Answer 1 · 2012-02-17

I'm sure that there is a clever way to do this using MySQL queries to the UCSC genome database tables, probably in conjunction with Galaxy. Here's my less-clever way:

Follow the instructions in this answer to obtain a BED file with the introns; we'll call it introns.txt
Go to the Affymetrix page for exon arrays and login (create account if required)
Scroll down to "Archived NetAffx Annotation Files" and download the appropriate file for HG18; it's named HuEx-1_0-st-v2.na29.hg18.transcript.csv.zip

Next: unzip the exon annotation file and extract just the relevant fields:

unzip HuEx-1_0-st-v2.na29.hg18.transcript.csv.zip
grep -v "^#" HuEx-1_0-st-v2.na29.hg18.probeset.csv | 
awk 'BEGIN {FS=","; OFS=","} {print $1,$2,$3,$4,$5,$6,$7}' > huex.csv
head -3 huex.csv # first couple of lines

"probeset_id","seqname","strand","start","stop","probe_count","transcript_cluster_id"
"2315101","chr1","+","1788","2030","4","2315100"
"2315102","chr1","+","2520","2555","4","2315100"

Now, read those files into R and find overlaps using the GenomicRanges package.

library(GenomicRanges)
introns <- read.table("introns.txt", header = F, stringsAsFactors = F)
colnames(introns) <- c("chr", "start", "end", "name", "score", "strand")
huex <- read.table("huex.csv", sep = ",", header = T, stringsAsFactors = F)
# make GRanges objects
introns.gr <- GRanges(seqnames = Rle(introns$chr),
                      ranges = IRanges(start = introns$start,
                                       end = introns$end,
                                       names = introns$name),
                      strand = Rle(introns$strand))

huex.gr <- GRanges(seqnames = Rle(huex$seqname),
                   ranges = IRanges(start = huex$start,
                                    end = huex$stop,
                                    names = huex$probeset_id),
                                    strand = Rle(huex$strand))

# find probesets wholly within introns
ov <- findOverlapshuex.gr, introns.gr, type = "within")

The matchMatrix() method returns the corresponding rows for query (probesets) and subject (introns). Here's what I got back for chromosome 21:

head(matchMatrix(ov))
      query subject
# [1,] 802861     140
# [2,] 802866     144
# [3,] 802867     144
# [4,] 802870     142
# [5,] 802870     145
# [6,] 802871     142

And you can confirm that selected probesets are within introns:

c(introns[140,]$start, introns[140,]$end)
# [1] 13332616 13336693
c(huex[802861,]$start, huex[802861,]$stop)
# [1] 13332617 13332722