Probesets In Introns
1
0
Entering edit mode
12.2 years ago
polarise ▴ 380

Hi All,

I need to get the probesets for HuEx_1.0stv2 that map to hg18 introns. Specifically, I only want regions that are strictly intronic (not containing exonic bases of other transcripts for each gene). I know of exonmap and X:map. Exonmap requires that I install Ensembl (unimaginarily huge task) and X:map does not have hg18.

Any clues?

Regards,

Paul K

intron hg ensembl • 2.3k views
ADD COMMENT
0
Entering edit mode

Assuming you mean introns defined by UCSC HG18 genes.

ADD REPLY
3
Entering edit mode
12.2 years ago
Neilfws 49k

I'm sure that there is a clever way to do this using MySQL queries to the UCSC genome database tables, probably in conjunction with Galaxy. Here's my less-clever way:

  1. Follow the instructions in this answer to obtain a BED file with the introns; we'll call it introns.txt
  2. Go to the Affymetrix page for exon arrays and login (create account if required)
  3. Scroll down to "Archived NetAffx Annotation Files" and download the appropriate file for HG18; it's named HuEx-1_0-st-v2.na29.hg18.transcript.csv.zip

Next: unzip the exon annotation file and extract just the relevant fields:

unzip HuEx-1_0-st-v2.na29.hg18.transcript.csv.zip
grep -v "^#" HuEx-1_0-st-v2.na29.hg18.probeset.csv | 
awk 'BEGIN {FS=","; OFS=","} {print $1,$2,$3,$4,$5,$6,$7}' > huex.csv
head -3 huex.csv # first couple of lines

"probeset_id","seqname","strand","start","stop","probe_count","transcript_cluster_id"
"2315101","chr1","+","1788","2030","4","2315100"
"2315102","chr1","+","2520","2555","4","2315100"

Now, read those files into R and find overlaps using the GenomicRanges package.

library(GenomicRanges)
introns <- read.table("introns.txt", header = F, stringsAsFactors = F)
colnames(introns) <- c("chr", "start", "end", "name", "score", "strand")
huex <- read.table("huex.csv", sep = ",", header = T, stringsAsFactors = F)
# make GRanges objects
introns.gr <- GRanges(seqnames = Rle(introns$chr),
                      ranges = IRanges(start = introns$start,
                                       end = introns$end,
                                       names = introns$name),
                      strand = Rle(introns$strand))

huex.gr <- GRanges(seqnames = Rle(huex$seqname),
                   ranges = IRanges(start = huex$start,
                                    end = huex$stop,
                                    names = huex$probeset_id),
                                    strand = Rle(huex$strand))

# find probesets wholly within introns
ov <- findOverlapshuex.gr, introns.gr, type = "within")

The matchMatrix() method returns the corresponding rows for query (probesets) and subject (introns). Here's what I got back for chromosome 21:

head(matchMatrix(ov))
      query subject
# [1,] 802861     140
# [2,] 802866     144
# [3,] 802867     144
# [4,] 802870     142
# [5,] 802870     145
# [6,] 802871     142

And you can confirm that selected probesets are within introns:

c(introns[140,]$start, introns[140,]$end)
# [1] 13332616 13336693
c(huex[802861,]$start, huex[802861,]$stop)
# [1] 13332617 13332722
ADD COMMENT
0
Entering edit mode

That worked quite well. Just some observations made:

  • the CSV file (huex.csv) has '---' in some positions for strand and this raises an error with GenomicRanges; I removed them using $sed -i 's/---/*/g' huex.csv
  • requires GenomicRanges v.1.3 or greater for the findOverlaps(query,subject,type="within") to work (see https://stat.ethz.ch/pipermail/bioconductor/2011-January/037460.html for the actual bug report)
  • to get the actual probesets requires:

    probeset_indices <- unique(queryHits(ov))
    probesets <- huex[probeset_indices,]
    write.table(probesets,file="probesets_in_introns.txt",col.names=T,row.names=F,quote=F,sep="\t")
    
ADD REPLY

Login before adding your answer.

Traffic: 2385 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6