Question

Salmon transcriptome generation

0

Entering edit mode

5.1 years ago

Morris_Chair ▴ 350

Dear all,

I want to count the reasds from RNA seq by using SALMON tool. I was told that the first thing to do is to create a fasta file with the sequence information (fasta file from reference genome + annotation file.GTF) such as :

(salmon) [@ws7910 RNAseq]$ gffread -w transcripts.fa -g Homo_sapiens.GRCh38.cdna.all.fa Homo_sapiens.GRCh37.75.gtf

I get this result

No fasta index found for Homo_sapiens.GRCh38.cdna.all.fa. 
Rebuilding, please wait..
Fasta index rebuilt.
Error creating file: annotation/transcript/transcripts.fa

It creates a file named Homo_sapiens.GRCh38.cdna.all.fa.fai

On the salmon website seems that this step is not necessary so can you please tell me what am I doing wrong? and what is the best way to start using salmon?

I'd prefer to do the counting with BAM files already alignes using TopHat2

Thank you

RNA-Seq • 7.4k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 5.1 years ago by Morris_Chair ▴ 350

score 5 · Answer 1 · 2019-03-10

5

Entering edit mode

5.1 years ago

Rob 6.5k

Hi Morris,

The easiest way to start using salmon is to download the transcriptome file directly (e.g. from Gencode --- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.transcripts.fa.gz). Then, you can create a salmon index via:

salmon index -t gencode.v29.transcripts.fa.gz -i gencode_v29_idx

and then you can quantify using gencode_v29_idx as the salmon index.

ADD COMMENT • link 5.1 years ago by Rob 6.5k

0

Entering edit mode

Hi Rob, than you for you answer, why you omit the K-mer size for the index generation? -K

thanks

ADD REPLY • link 5.1 years ago by Morris_Chair ▴ 350

0

Entering edit mode

Hi Morris,

The -k argument has a default value (31) that is used if -k is not provided. I you wish to use the default, you don't have to pass that option explicitly. If you want to use another value of k, then you can pass that to the index command.

ADD REPLY • link 5.1 years ago by Rob 6.5k

1

Entering edit mode

As a rule of thumb in my experience, the only really crucial thing is that the k-mer length is longer or equal to the read length. k-mer length only has a minor influence on the mapping rate, see Salmon Quantification for RNA-seq Read Pairs with Different Lengths to get a vague idea. Most important factor, given that the library is of high quality without contaminations is the read length which eventually comes down to a proper experimental design. When indexing GENCODE files also mind passing the --gencode option to salmon.

ADD REPLY • link 5.1 years ago by ATpoint 81k

0

Entering edit mode

I created the salmon index as you describe and it's ok. I tell you in advance that I'm newe to this and I'm lost for making the code to quantify the reads using gencode_v29_idx.

I have 4 bam files to quantify and I don't know how to formulate the code First, can I run multiple bam files? (two treated and two control for instance) or I'll have to do one by one? I thought that the structure of the code should bee like this but I don't know how to insert the gencode_v29_idx

Salmon quant  --libType A -p 4 --validate mapping --seqBias --gcBias -a 1file.bam 2fie.bam 3.file.bam 4fie.bam -o salmon_quant

thank you Rob

ADD REPLY • link 5.1 years ago by Morris_Chair ▴ 350

0

Entering edit mode

probably it's more covenient using fastq filles because there are not others bias

ADD REPLY • link 5.1 years ago by Morris_Chair ▴ 350