Question

Completely confused about terminology and what I'm actually analysing here

0

Entering edit mode

6.4 years ago

l.burke.1 ▴ 10

I'm new to bioinformatics and I've been tasked with performing analysis of RNA-seq data. Essentially I have lots of Illumina paired reads from different samples that I want to map to a reference and carry out differential expression analysis with.

Since the species in question is a non-model species, there is no sequenced, annotated genome available. I also cannot assemble a transcriptome de novo.

Instead, I have a set of several thousand 'contigs' assembled from thousands of 'expressed sequence tags' (ESTs). These contigs have been annotated in so far that each one has been given a (probable) biological function. For example:

Contig ID : Contig0001

Contig Description : rrm-containing protein

Contig Length : 1235

1:ATACAGCTTGGAAAATTAAATACCTTTGCACTCTGTCTGATCACCTTCACAGCCTTGCTT

61:AGATTCCTTTTCTCTTCTCTATTTCTTCTTCTTCTTTTTTTGACATGGAAGAGGAAGAGC

   [...]

1201:TTGTAGACATCCATTTTGTATACTCGGAATTTCTA

and

Contig ID : Contig0002 

Contig Description : chaperone protein dnaj 16-like isoform x1 

Contig Length : 948 

1:TTTCTATTTTGCCTTCGATTAATTTTCATCTTTCAATAAGTTTTACTTTAATTTCTTCCG 

61:ACTTCATTTTATTCACGCAACATTTCACCATTGAGTTTGCCAACTGAAGAGAACCGTAGC

 [...]

901: CAGACAGCTTCGATGACAGTCAATAAGGATCCAGATGCGACATTTTTC

1) What exactly could these contigs represent? I initially thought they were complete transcripts, but I'm now thinking they only form part of transcripts?

2) How would I go about mapping my reads to these contigs? Is that even possible? It looks like the file isn't in fasta format - would a more suitable format look something like this?

>Contig0001|rrm-containing_protein|1235

ATACAGCTTGGAAAATTAAATACCTTTGCACTCTGTCTGATCACCTTCACAGCCTTGCTT

AGATTCCTTTTCTCTTCTCTATTTCTTCTTCTTCTTTTTTTGACATGGAAGAGGAAGAGC

  [...]

TTGTAGACATCCATTTTGTATACTCGGAATTTCTA

3) If I did manage to map my reads and perform quantification - what exactly am I quantifying? Would it be transcript-level quantification or gene-level quantification?

Any help would be greatly appreciated! Let me know if you need more info.

RNA-Seq rna-seq next-gen sequence • 2.2k views

ADD COMMENT • link updated 6.4 years ago by i.sudbery 19k • written 6.4 years ago by l.burke.1 ▴ 10

score 1 · Answer 1 · 2017-11-22

1) some of them probably are, some are probably not, if these are published, all info should be in the paper;

2) I am not an RNA person, but any aligner, such as bwa, should work;

3) I guess it depends on your data and goals of the project. You probably should get some theoretical knowledge in RNA-seq. Take a look at this question:

How Do I Get Started Working With Rna-Seq Data

And take a look at these resources:

https://www.ebi.ac.uk/training/online/course/ebi-next-generation-sequencing-practical-course/rna-sequencing/rna-seq-analysis-transcriptome

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4972086/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4728800/

score 0 · Answer 2 · 2017-11-23

0

Entering edit mode

6.4 years ago

pbpanigrahi ▴ 420

Answer is just a wild guess. Since (1) ESTs are short sequences from 5' and 3' of gene (2) and you have assembled them into contigs (3) and annotation for these contigs are available, you can safely consider the contigs as representative of the genes. You can align reads against these ESTs to see which genes/transcripts are expressed. Differential expression study can be tricky since you don't have transcript length information.

ADD COMMENT • link 6.4 years ago by pbpanigrahi ▴ 420

0

Entering edit mode

Differential expression study can be tricky since you don't have transcript length information

I disagree. You don't need transcript length information to perform differential expression analysis with DESeq or edgeR and similar raw-counts based methods. In addition, if you really want to compute FPKM, contigs length can serve as a proxy for transcript length: If the true transcript is longer than the annotated contig, then the reads corresponding to the missing part will not map, thus the transcript true length is irrelevant here.

ADD REPLY • link 6.4 years ago by Carlo Yague 8.6k

0

Entering edit mode

This may be a silly question - but how would I compute the counts using a tool like HTSeq. I understand how I would go about doing it if a well annotated genome was available, but that isn't my case. HTSeq requires an annotation file (GFF) but I don't have that - is it feasible to build my own annotation file with the data/info that I've been given? Or should I explore using an alignment-free method such as salmon?

Edit: in this post, Devon Ryan suggests filtering out multi mappers and then just counting reads per contig using something like samtools idxstats. This could be inaccurate with a de novo assembly since you'd likely be throwing away a lot of reads, however with my reference I'm getting ~0.03% multi mappers - does it seem reasonable to just chuck these away?

Found the corresponding paper here.

ADD REPLY • link 6.4 years ago by l.burke.1 ▴ 10

0

Entering edit mode

is it feasible to build my own annotation file with the data/info that I've been given?

Its possible, you could do something like : seqname='contig name', start =1; end ='Contig Length'; strand='+'.But Devon's suggestion is much easier. Just be careful to differentiate between reads originating from the forward and reverse strand (if you have a strand-specific dataset).

Or should I explore using an alignment-free method such as salmon?

That's the alternative. Both are ok IMO.

with my reference I'm getting ~0.03% multi mappers - does it seem reasonable to just chuck these away?

Sure.

ADD REPLY • link 6.4 years ago by Carlo Yague 8.6k

score 0 · Answer 3 · 2017-11-23

If is more or less impossible to know whether these contigs better represent genes or transcripts. I would be inclined to call the 'transfrags' (A TRANScripted FRAGment). This would of course be made easier if the organism in question does not have splicing. Then transcripts and genes are the same thing.
Whichever approach you take, you'll need to convert those contigs to fasta format. You'll want to create a single fasta file that contains the sequence of all the contigs in seperate records. At that point you could go one of two ways:
- align with a traditional aligner, since you don't know about the splicing status of these contigs, I recommend a splicing aware aligner like HISAT or STAR. Your first job will be to treat your contigs as the genome and build an index using the apporiate command for the aligner you choose to use. You could then quantify by just looking at the number of reads that map to each contig using samtools samtools idxstats mybamfile.bam | cut -f1,3
- A second approach is alignment free quantitation with tools such as salmon or kallisto. This might work particularly well here as they are designed to take in fasta files of transcript sequences and output quantitations: They never consider genome sequence. They will also seamlessly deal with the multimapping problem, which is a real problem if you are not sure if you are dealing with genes or transcripts. Again turn the contigs into fasta, build an index over the fasta, and then quantify.
See above. Difficult to tell. Treat them as transcripts, but transcripts is what we should be dealing with anyway because those are the real physical objects. Genes are just an imaginary abstraction created to make biology easier to talk about.