pd.hugene.2.0.st : Get probes start and stop
2
2
Entering edit mode
6.8 years ago
cndl ▴ 60

Hello world !

I am working on microarray data (Hugene 2.0 st from affymetrix) and I was willing to get probe-level informations.

For one probeset (fsetid), I need all the different probes (fid) and their chromosome, start, stop, sequence etc.

I know I can get this on NetAffx but as I have to check multiple probesets, I wanted to use the library oligo in association with pd.hugene.2.0.st

Here is my code tested on a 'test-list' of 6 .CEL files:

## Library loading
library(oligo)
library(dplyr)
librarypd.hugene.2.0.st)

# .CEL importation and read
celFiles <- list.celfiles('./', full.names=TRUE)
batch = read.celfiles(celFiles)

# Expression matrix tmp
tmp=batch@assayData$exprs
dim(tmp) 

# Connexion pd.hugene.2.0.st@getdb() and normalization
conn = pd.hugene.2.0.st@getdb()
norm = rma(batch, background=TRUE,  normalize=TRUE, target="core") 

# Data importation
tableNames=dbListTables(conn)
tablesList=list()

for (tableName in tableNames){
  tablesList[[tableName]]=dbGetQuery(conn,paste('SELECT * FROM',tableName))
  print(tableName)
  str(tablesList[[tableName]])
}

# Simplification of the tablesList dataframe
testList <- tablesList$featureSet
testList2 <- tablesList$pmfeature
testListComplet <- dplyr::full_join(testList, testList2, by = "fsetid")

# Test on one probeset randomly selected : probeset 16730541
ind = testListComplet$fid[which(testListComplet$fsetid == 16730541)]
tmp[ind,]
probeset_test <- subset(testListComplet, fsetid == 16730541)
probeset_test

Here is the probeset_test result I get :

 fid        fsetid strand     start      stop transcript_cluster_id exon_id crosshyb_type level chrom type     fid   atom   x    y
153940 16730541      0 102218019 102218094              16730540 5033332             1    NA    11    1  944681 153940  48  586
153941 16730541      0 102218019 102218094              16730540 5033332             1    NA    11    1 2165196 153941 279 1343

So, according to pd.hugene.2.0.st, for probeset 16730541, I have 2 probes : 153940 & 153941 with a common start (102218019) and stop (102218094). But when I check on the NetAffx website ( https://www.affymetrix.com/analysis/netaffx/exon/wtgene_probe_set.affx?pk=712:16730541 ), here are the two probes I get :

atcagcggcgccgacaaggagatac chr11:102218019-102218043 (+)

cagcaaacacggaagctgcgcggct chr11:102218070-102218094 (+)

Did I do something wrong ? Did I misunderstand something ? Could you please enlighten me ?!

I am sorry if this is a stupid question but I could not figure it out by myself :(

Thank you so much in advance.

Here is my R info :

platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          4.0                         
year           2017                        
month          04                          
day            21                          
svn rev        72570                       
language       R                           
version.string R version 3.4.0 (2017-04-21)
nickname       You Stupid Darkness
pd.hugene.2.0.st affymetrix microarray probe • 3.2k views
ADD COMMENT
3
Entering edit mode
6.8 years ago

You are right (almost!). If you see further in the NetAffx page that you linked

Probe Set Location: The genomic location of the probe set for the genome assembly used at array design time (See Genome Source). These coordinates begin at the first base of the first probe sequence and end at the last probe of the probe set.

ADD COMMENT
0
Entering edit mode

Oh, I did not noticed that !! Thank you !!

But, for the purpose of using R code-based analysis, do you know how I could get the right start and stop corresponding to each probe ? (instead of checking on NetAffx every time)

When I have something like 2 probes per probeset, it is easy to proceed by NetAffx but sometimes, I have way more so it gets pretty overwhelming !

ADD REPLY
0
Entering edit mode

I haven't used this array. But it seems to me that the info you are looking for is not in pmfeature, but in pmSequence. Do you have pmSequence in your tablesList?

For some clues, see Manipulation of probe-level data at https://github.com/benilton/oligoOld/wiki/Getting-the-grips-with-the-oligo-Package

Also see https://www.bioconductor.org/packages/devel/bioc/vignettes/oligo/inst/doc/oug.pdf

ADD REPLY
2
Entering edit mode

After multiple fails, I finally got every piece of info I needed (with the help of the github protocol and the pdf you mentioned)

So here is the code with all info :

field <- c('fid', 'fsetid', 'level', 'type', 'x', 'y', 'chrom', 'start', 'stop')
pInfo <- getProbeInfo(batch, field=field, sortBy='fid', target='core')
library(Biostrings)
data(pmSequence, package=annotationpd.hugene.2.0.st))
idx <- match(pInfo[["fid"]], pmSequence[["fid"]])
pmSequence <- pmSequence[idx,]
pmSequence <- as.data.frame(pmSequence)
final_annotated_table <- dplyr::full_join(pmSequence, pInfo, by = "fid")

With this result obtained :

  fid                  sequence man_fsetid   fsetid  x y    start     stop chrom level type
1   7 ACCGCGACTCACTTGGGTAGGACGG   16859314 16859314  6 0 16435735 16435785 chr19  <NA> main
2  10 CTCGGCGTCCCTGGGTGCGACAGAC   16928098 16928098  9 0 24237294 24237359 chr22  <NA> main
3  11 AGTCCGGCTGTCCTGGGTCGGACCC   16666749 16666749 10 0 86064814 86064902  chr1  <NA> main
4  12 AACCCGGAGTGCCTGGGTCGGCTGT   16868661 16868661 11 0 10429844 10429870 chr19  <NA> main
5  13 TCACGACGTGACCTGGGTCGGATGT   16914042 16914042 12 0 43030058 43030082 chr20  <NA> main
6  14 GCGACGGACTCTGGGTCGAGACACA   16861465 16861465 13 0 37383732 37384085 chr19  <NA> main

Thank you very much for your help and patience !

Kind Regards

Claire

ADD REPLY

Login before adding your answer.

Traffic: 3003 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6