How to separate lncRNA and mRNA from RNA-Seq data for differential expression?
2
0
Entering edit mode
5.2 years ago

Hi

I am having RNA-Seq data from TCGA. I want to separate lncRNA and mRNA from RNA-Seq data for differential expression. Please, guide me how to separate lncRNA and mRNA from RNA-Seq data?

Thank you

RNA-Seq • 2.6k views
ADD COMMENT
2
Entering edit mode

Please explain how you performed the analysis? Alignment -> quantification (which annotation) -> Differential expression (which tool) ? And what kind of file you have now (paste sample output)?

When you provide more information, you get more specific response.

ADD REPLY
0
Entering edit mode

I downloaded raw RNA-Seq data of cancer in fastq format. Then I performed alignment using Tophat. Now, I am planning to do quantification and differential expression using Salmon and Deseq2. But I have to identify lncRNA and mRNA before differential expression.

ADD REPLY
2
Entering edit mode

few remarks:

TopHat is no longer advised to do read alignment (using Salmon you even don't need to do actual alignments anymore).

I'm puzzled why you are eager to filter gene types before doing DEG analysis? why not use all of them and filter after doing DEG?

ADD REPLY
2
Entering edit mode

Can you add a bit more details on this? For example, what kind of data do you have? which organism are you working with?

From 'raw' RNAseq data it will be impossible to filter out reads that are derived from either one of them. If you align them to a genome or transcriptome you might be able to distinguish them.

ADD REPLY
3
Entering edit mode
3.6 years ago

Just adding an even easier way to identify biotype automatically - this will create an annotation data-frame that can be used for any purpose downstream:

require('biomaRt')
mart <- useMart('ENSEMBL_MART_ENSEMBL')
mart <- useDataset('hsapiens_gene_ensembl', mart)
annot <- getBM(
  mart = mart,
  attributes = c(
    'ensembl_gene_id',
    'hgnc_symbol',
    'entrezgene_id',
    'gene_biotype'),
  uniqueRows = TRUE)

  ensembl_gene_id hgnc_symbol entrezgene_id   gene_biotype
1 ENSG00000210049       MT-TF            NA        Mt_tRNA
2 ENSG00000211459     MT-RNR1            NA        Mt_rRNA
3 ENSG00000210077       MT-TV            NA        Mt_tRNA
4 ENSG00000210082     MT-RNR2            NA        Mt_rRNA
5 ENSG00000209082      MT-TL1            NA        Mt_tRNA
6 ENSG00000198888      MT-ND1          4535 protein_coding

unique(annot$gene_biotype)
 [1] "Mt_tRNA"                            "Mt_rRNA"                           
 [3] "protein_coding"                     "transcribed_unprocessed_pseudogene"
 [5] "unprocessed_pseudogene"             "miRNA"                             
 [7] "lncRNA"                             "processed_pseudogene"              
 [9] "snRNA"                              "transcribed_processed_pseudogene"  
[11] "misc_RNA"                           "TEC"                               
[13] "transcribed_unitary_pseudogene"     "snoRNA"                            
[15] "scaRNA"                             "rRNA_pseudogene"                   
[17] "unitary_pseudogene"                 "polymorphic_pseudogene"            
[19] "pseudogene"                         "rRNA"                              
[21] "IG_V_pseudogene"                    "ribozyme"                          
[23] "sRNA"                               "TR_V_gene"                         
[25] "TR_V_pseudogene"                    "TR_D_gene"                         
[27] "TR_J_gene"                          "TR_C_gene"                         
[29] "TR_J_pseudogene"                    "IG_C_gene"                         
[31] "IG_C_pseudogene"                    "IG_J_gene"                         
[33] "IG_J_pseudogene"                    "IG_D_gene"                         
[35] "IG_V_gene"                          "IG_pseudogene"                     
[37] "translated_processed_pseudogene"    "scRNA"                             
[39] "vault_RNA"                          "translated_unprocessed_pseudogene"

Kevin

ADD COMMENT
1
Entering edit mode
5.2 years ago
ATpoint 82k

Check the GTF (genome annotation file) that matches your analysis and filter out those genes that have a CDS (coding sequence) and/or are annotated as protein-coding. Having the gene names, you can then scan your count matrix before or after differential expression analysis for the genes of interest.

Edit: See Kevin's comment below for details.

ADD COMMENT
0
Entering edit mode

Is there any tool to filter out lncRNA and mRNA from GTF? How can I scan count matrix using the gene names?

ADD REPLY
2
Entering edit mode

You can grep lines annotated as lncRNA/mRNA from your GTF file.

ADD REPLY
2
Entering edit mode

Indeed, was just doing it:

Obtain the relevant GTF from GENCODE (you will want the Comprehensive gene annotation) and then identify lincRNA and protein_coding mRNA via the gene_type / transcript_type tag in the GTF.

grep -e "lincRNA" /Kev/CollegeWork/ReferenceMaterial/GENCODE/GRCh38.p12/gencode.v28.annotation.gtf | head -5
chr1    HAVANA  gene    29554   31109   .   +   .   gene_id "ENSG00000243485.5"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";
chr1    HAVANA  transcript  29554   31097   .   +   .   gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_name "RP11-34P13.3-001"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
chr1    HAVANA  exon    29554   30039   .   +   .   gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_name "RP11-34P13.3-001"; exon_number 1; exon_id "ENSE00001947070.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
chr1    HAVANA  exon    30564   30667   .   +   .   gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_name "RP11-34P13.3-001"; exon_number 2; exon_id "ENSE00001922571.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
chr1    HAVANA  exon    30976   31097   .   +   .   gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_name "RP11-34P13.3-001"; exon_number 3; exon_id "ENSE00001827679.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";

Take a look at all biotypes, here: https://www.gencodegenes.org/pages/biotypes.html

ADD REPLY
0
Entering edit mode

This is useful. After I grep different transcript biotypes from the GTF using the applicable search terms, how can I get these transcript biotypes from my own FASTA file? So that I get RNA reads separated based on transcript biotype. I am starting with my FASTA file with the RNA-Seq reads and then the GTF. Thank you.

ADD REPLY
0
Entering edit mode

If such a FASTA is not already available on GENCODE at the site(s) to which we link you, then you could filter the 'comprehensive' FASTA using AWK or Python. If you do not feel too comfortable going that route, you could just derive counts over all transcripts and then filter this data when you aim to use it downstream, e.g., in DESeq2

ADD REPLY
0
Entering edit mode

Thank you for your reply and for the code for an automatic route. Actually, I am not the OP. But your post C: How to separate lncRNA and mRNA from RNA-Seq data for differential expression? looked a lot like what I had been wanting to do except that it is what I want to do for my RNA-Seq reads, if possible. Although what I want to do is related to what the OP wanted to do, mine is how do I split up my FASTA file which contains the RNA-Seq reads into the different transcript biotypes? This is only dealing with sequences. Thank you.

ADD REPLY
0
Entering edit mode

Hi, as I suggested, you could do this with awk or Python. If you have no coding skills, then it is obviously a problem.

As mentioned, also, if your aim is to perform read count abundance over these transcripts, then you could perform this [the read count abundance step] over all transcripts, and then filter for, e.g., lncRNA, after this step. It would be easier than filtering the FASTA file, I think.

ADD REPLY
0
Entering edit mode

My apologies that I was not clear. At this time, I shall not be performing read count abundance. My aim is how to go from having a bunch of RNA-Seq reads in my FASTA file to knowing what transcript biotypes are represented in the reads. Thank you.

ADD REPLY
1
Entering edit mode

I see, but, to do that, you still need to align these reads.

In the GENCODE FASTA header, the biotype is already encoded [in the header]. So, it is still feasible for you to align your reads to all transcripts and then check to see for which ones most reads have aligned.

Look, here are the FASTA headers in the current GENCODE release (see biotype at end of header):

zcat gencode.v35.transcripts.fa.gz | grep -e "^>" | head -15
>ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|
>ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene|
>ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene|
>ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|
>ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712|lncRNA|
>ENST00000469289.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-2HG-201|MIR1302-2HG|535|lncRNA|
>ENST00000607096.1|ENSG00000284332.1|-|-|MIR1302-2-201|MIR1302-2|138|miRNA|
>ENST00000417324.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002842.1|FAM138A-201|FAM138A|1187|lncRNA|
>ENST00000461467.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002843.1|FAM138A-202|FAM138A|590|lncRNA|
>ENST00000606857.1|ENSG00000268020.3|OTTHUMG00000185779.1|OTTHUMT00000471235.1|OR4G4P-201|OR4G4P|840|unprocessed_pseudogene|
>ENST00000642116.1|ENSG00000240361.2|OTTHUMG00000001095.3|OTTHUMT00000492680.1|OR4G11P-202|OR4G11P|1414|processed_transcript|
>ENST00000492842.2|ENSG00000240361.2|OTTHUMG00000001095.3|OTTHUMT00000003224.3|OR4G11P-201|OR4G11P|939|transcribed_unprocessed_pseudogene|
>ENST00000641515.2|ENSG00000186092.6|OTTHUMG00000001094.4|OTTHUMT00000003223.4|OR4F5-202|OR4F5|2618|protein_coding|
>ENST00000335137.4|ENSG00000186092.6|OTTHUMG00000001094.4|-|OR4F5-201|OR4F5|1054|protein_coding|
>ENST00000466430.5|ENSG00000238009.6|OTTHUMG00000001096.2|OTTHUMT00000003225.1|AL627309.1-201|AL627309.1|2748|lncRNA|
ADD REPLY
0
Entering edit mode

Please, could this alignment be done using STAR? STAR requires reference genome sequences (FASTA files) and annotations (GTF file) that it uses to generate genome index files. Then in the second mapping step, STAR uses the genome index files along with the user's RNA sequence reads (FASTA files) to map the user's RNA sequence reads to the genome.

When giving examples of acceptable genome sequence files for GENCODE, STAR lists "files marked with PRI (primary)", which makes me think it wants me to use ....primary_assembly.genome.fa.gz but from this discussion, I want to use gencode.v35.transcripts.fa.gz

The STAR manual contains the information above in sections 1.2 and 2.21 on pages 4, 5, 6. https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf

Thank you.

ADD REPLY
2
Entering edit mode

I recommend investing time learning Unix tool usage, they're powerful and versatile, you'll need them everyday in your bioinformatics career.

ADD REPLY

Login before adding your answer.

Traffic: 2734 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6