Question

denovo assembly of unmapped reads

0

Entering edit mode

6.5 years ago

ashish ▴ 680

I have RNA-seq data of a wheat translocation line which has its chromosome short arm replaced by another wild variety chromosome short arm. There is no genomic or transcriptomic information available for that wild variety. I want to see what kind of genes are expressing from the new short arm. For this I was thinking if I can do denovo assembly of the unmapped reads and then blast the assembled transcripts with nr database. I want to know if this strategy is correct or not or if there is a better way to do it.

update: number of unmapped read pairs is around 8 million, total reads were around 33 million and I used the following parameters in hisat2

hisat2 -x wheat_hisat_index -1 13R_R1.fastq -2 13R_R2.fastq -S 13.sam -p 12 --mp 8,4 --rdg 7,5 --rfg 7,5 --un-conc 13_unmapped_pairs

RNA-seq • 2.9k views

ADD COMMENT • link 6.5 years ago by ashish ▴ 680

0

Entering edit mode

Provided you have an adequate amount of reads left over to do a reasonable assembly. What fraction of data has been left over? You are only going to capture genes that are unique for that short arm.

ADD REPLY • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

I also have the same doubt. Currently the job is running, Soon I will update the number of unmapped reads left and my aim is to capture only the unique genes.

ADD REPLY • link 6.5 years ago by ashish ▴ 680

0

Entering edit mode

I've updated the question.

ADD REPLY • link 6.5 years ago by ashish ▴ 680

0

Entering edit mode

Good point, I gave you my answer below but it holds only as long as @genomax's point is fulfilled.

ADD REPLY • link 6.5 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

To the updated question:

your mismatch penalty seems to be quite high compared to what the threshold for alignment scores is if you don't touch the scoring function.

Increasing penalties for gaps and mismatches obviously results in more unmapped reads.

ADD REPLY • link 6.5 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

I did not understand the --score-min parameter very well from the manual. Can you suggest me some better values?

ADD REPLY • link 6.5 years ago by ashish ▴ 680

1

Entering edit mode

There is no such thing as "better values". Parameters have to be chosen according to your data. From the manual:

--score-min <func> : Sets a function governing the minimum alignment score needed for an alignment to be onsidered "valid" (i.e. good enough to report). This is a function of read length. For instance, specifying L,0,-0.6 sets the minimum core function f to f(x) = 0 + -0.6 * x, where x is the read length. See also: [setting function options]. The default is L,0,-0.2.

What is your maximum read length after trimming? Let's say it's 100 nt for example. In this function, this would be multiplied by the third parameter, and after the multiplication the second parameter will be added. The final value is the threshold for the alignment score that is applied to keep or discard a mapping result of a read. That score is negative and every time you have a mismatch or a gap the algorithm will subtract the penalties that you specified for gaps and mismatches.

I think you should calculate the alignment score threshold in a way that you get what you want. For example, if you want to call variants between strains of the same species, you might allow very few mismtaches because you know that the genomes are very similar. If you're mapping reads from different species to a reference, you might allow more gaps/mismtaches because biologically speaking they're expected.

ADD REPLY • link 6.5 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Thanks Macspider, I got it.

ADD REPLY • link 6.5 years ago by ashish ▴ 680

score 1 · Answer 1 · 2017-10-17

I think the best way to proceed would be to:

map your RNASeq reads against the wheat genome with stringent criteria with hisat2 or bowtie2 using:

  --mp <int>,<int>   max and min penalties for mismatch; lower qual = lower penalty <6,2>
  --rdg <int>,<int>  read gap open, extend penalties (5,3)
  --rfg <int>,<int>  reference gap open, extend penalties (5,3)
  --score-min <func> min acceptable alignment score w/r/t read length

save the unmapped reads to a file using:

  --un <path>           write unpaired reads that didn't align to <path>
  --al <path>           write unpaired reads that aligned at least once to <path>
  --un-conc <path>      write pairs that didn't align concordantly to <path>
  --al-conc <path>      write pairs that aligned concordantly at least once to <path>

assemble those unmapped reads with trinity or any other RNASeq read assembler
run a quantification method
run a differential expression method :)