How to deal with multiple Transcription Start Site in a gene?
5.2 years ago
@Poorya Parvizi22564
In order to measure the conservation (Phastcons score) of each gene, i am trying to look at -2000:2000 base pairs below and above from transcription start site (TSS), which will represent my promoter. But the problem is, there are more than one TSS for each gene. What should i do? which one should i choose?
Promoter
Phastcons Score
TSS
• 1.9k views
5.2 years ago
@Giovanni M Dall'Olio23
If you want to avoid confusion you can simply use the gene coordinates from the Known Gene UCSC track:
> library(Homo.sapiens)
> library(phastCons100way.UCSC.hg19)
> mypromoters = promoters(genes(TxDb.Hsapiens.UCSC.hg19.knownGene), 2000, -2000)
> mypromoters$cons100way = scores(phastCons100way.UCSC.hg19, mypromoters) # takes a while
> mypromoters
GRanges object with 23056 ranges and 2 metadata columns:
seqnames ranges strand | gene_id cons100way
<Rle> <IRanges> <Rle> | <character> <numeric>
1 chr19 [ 58874015, 58876214] - | 1 0.10872727
10 chr8 [ 18246755, 18248954] + | 10 0.36422727
100 chr20 [ 43280177, 43282376] - | 100 0.04400000
1000 chr18 [ 25757246, 25759445] - | 1000 0.01645455
10000 chr1 [244006687, 244008886] - | 10000 0.03836364
... ... ... ... ... ... ...
9991 chr9 [115095745, 115097944] - | 9991 0.25700000
9992 chr21 [ 35734323, 35736522] + | 9992 0.29409091
9993 chr22 [ 19109768, 19111967] - | 9993 0.01236364
9994 chr6 [ 90537619, 90539818] + | 9994 0.07536364
9997 chr22 [ 50964706, 50966905] - | 9997 0.15281818
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
Login before adding your answer.
Hi again The code below is not working for me, mypromoters$cons100way = scores(phastCons100way.UCSC.hg19, mypromoters) what should i do? another question is, instead of gene_id i put tx_name column. but can i convert these UCSC ids to ensmble gene ID. actually i convert it by getBM package of R, but it is not seems good.what do you think?