Hela: Transcript Expression Database?
2
3
Entering edit mode
12.5 years ago

I wish to determine the most abundant splice variant in Hela cells for each gene.

I quick search of GEO shows that there are RNA-seq experiments I could use to mine for this information, but before I reinvent the wheel I am wondering if is anyone is aware of a repository with pre-calculated values on a per-transcript basis. (e.g. BioGPS does not appear to have this information)

splicing database rna • 6.4k views
ADD COMMENT
0
Entering edit mode

Useful tips here on how to do the analysis but it seems that this has not been pre-calculated. I might put this forward for an MSc project.

ADD REPLY
3
Entering edit mode
12.5 years ago

Assuming you mean the most abundant known splice variant of each gene you could do the following. Download an existing RNA-seq data set from GEO for HeLa such as: GSM759888. Since a BAM is provided, you can skip the alignment step and go straight to estimating isoform expression levels.

If you want to do your own alignments (say to take advantage of a new aligner version), convert the BAM back to FASTQs using Picard SamToFastq. Then run Tophat to get new alignments.

Next get a GTF file representing all transcripts of all genes for human build hg19/build37. For example, you can get one for all Ensembl transcripts here: Homo_sapiens.GRCh37.64.gtf.gz

Use this GTF to run Cufflinks with the -G option. Once it finishes you should have an isoforms.fpkm file. This contains the FPKM expression estimate for each transcript (each of which should also be marked with a gene ID). With that data in hand it will be trivial to identify the most highly expressed transcript for each gene that has multiple known transcripts.

If you do not want to be limited to the known or Ensembl predicted alternative transcripts of each gene the problem becomes more complicated. You could attempt to merge transcript GTFs from multiple sources to get a more comprehensive representation of existing transcript annotations. Or you could run Cufflinks in de novo mode, then figure out which correspond to the same loci (using Cuffcompare perhaps), and select the one with the highest abundance for each locus.

ADD COMMENT
0
Entering edit mode

Hi, I did what you wrote, running cufflinks -G Homo_sapiens.GRCh37.64.gtf GSM759888_hela.bam, but I get a segfault. The full output I get is this

Warning: Could not connect to update server to verify current version. Please check at the Cufflinks website (http://cufflinks.cbcb.umd.edu).

[11:47:48] Loading reference annotation.

[11:48:14] Inspecting reads and determining fragment length distribution.

Segmentation fault: 11

ADD REPLY
1
Entering edit mode
12.5 years ago

Alastair,

It may also be useful to look at EST data made from cDNA libraries taken from HeLa cells. Overall, this is not as informative as RNA-Seq but still may be able to distinguish abundance. You'll look for the number of EST entries corresponding to different splice variants (ie across a diagnostic splice junction).

ESTs will, in theory, be able to detect novel splice junctions. Intron retention, however, is always suspect because one is not certain if this is true intron retention or simply capture of contaminating unspliced mRNA or genomic DNA.

ADD COMMENT

Login before adding your answer.

Traffic: 1231 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6