Question

Hela: Transcript Expression Database?

3

Entering edit mode

12.5 years ago

Alastair Kerr 5.3k

I wish to determine the most abundant splice variant in Hela cells for each gene.

I quick search of GEO shows that there are RNA-seq experiments I could use to mine for this information, but before I reinvent the wheel I am wondering if is anyone is aware of a repository with pre-calculated values on a per-transcript basis. (e.g. BioGPS does not appear to have this information)

splicing database rna • 6.4k views

ADD COMMENT • link updated 12.5 years ago by Larry_Parnell 16k • written 12.5 years ago by Alastair Kerr 5.3k

0

Entering edit mode

Useful tips here on how to do the analysis but it seems that this has not been pre-calculated. I might put this forward for an MSc project.

ADD REPLY • link 12.5 years ago by Alastair Kerr 5.3k

score 3 · Answer 1 · 2011-11-15

Assuming you mean the most abundant known splice variant of each gene you could do the following. Download an existing RNA-seq data set from GEO for HeLa such as: GSM759888. Since a BAM is provided, you can skip the alignment step and go straight to estimating isoform expression levels.

If you want to do your own alignments (say to take advantage of a new aligner version), convert the BAM back to FASTQs using Picard SamToFastq. Then run Tophat to get new alignments.

Next get a GTF file representing all transcripts of all genes for human build hg19/build37. For example, you can get one for all Ensembl transcripts here: Homo_sapiens.GRCh37.64.gtf.gz

Use this GTF to run Cufflinks with the -G option. Once it finishes you should have an isoforms.fpkm file. This contains the FPKM expression estimate for each transcript (each of which should also be marked with a gene ID). With that data in hand it will be trivial to identify the most highly expressed transcript for each gene that has multiple known transcripts.

If you do not want to be limited to the known or Ensembl predicted alternative transcripts of each gene the problem becomes more complicated. You could attempt to merge transcript GTFs from multiple sources to get a more comprehensive representation of existing transcript annotations. Or you could run Cufflinks in de novo mode, then figure out which correspond to the same loci (using Cuffcompare perhaps), and select the one with the highest abundance for each locus.

score 1 · Answer 2 · 2011-11-15

Alastair,

It may also be useful to look at EST data made from cDNA libraries taken from HeLa cells. Overall, this is not as informative as RNA-Seq but still may be able to distinguish abundance. You'll look for the number of EST entries corresponding to different splice variants (ie across a diagnostic splice junction).

ESTs will, in theory, be able to detect novel splice junctions. Intron retention, however, is always suspect because one is not certain if this is true intron retention or simply capture of contaminating unspliced mRNA or genomic DNA.