denovo assembly of unmapped reads
1
0
Entering edit mode
6.5 years ago
ashish ▴ 680

I have RNA-seq data of a wheat translocation line which has its chromosome short arm replaced by another wild variety chromosome short arm. There is no genomic or transcriptomic information available for that wild variety. I want to see what kind of genes are expressing from the new short arm. For this I was thinking if I can do denovo assembly of the unmapped reads and then blast the assembled transcripts with nr database. I want to know if this strategy is correct or not or if there is a better way to do it.

update: number of unmapped read pairs is around 8 million, total reads were around 33 million and I used the following parameters in hisat2

hisat2 -x wheat_hisat_index -1 13R_R1.fastq -2 13R_R2.fastq -S 13.sam -p 12 --mp 8,4 --rdg 7,5 --rfg 7,5 --un-conc 13_unmapped_pairs
RNA-seq • 2.9k views
ADD COMMENT
0
Entering edit mode

Provided you have an adequate amount of reads left over to do a reasonable assembly. What fraction of data has been left over? You are only going to capture genes that are unique for that short arm.

ADD REPLY
0
Entering edit mode

I also have the same doubt. Currently the job is running, Soon I will update the number of unmapped reads left and my aim is to capture only the unique genes.

ADD REPLY
0
Entering edit mode

I've updated the question.

ADD REPLY
0
Entering edit mode

Good point, I gave you my answer below but it holds only as long as @genomax's point is fulfilled.

ADD REPLY
0
Entering edit mode

To the updated question:

your mismatch penalty seems to be quite high compared to what the threshold for alignment scores is if you don't touch the scoring function.

Increasing penalties for gaps and mismatches obviously results in more unmapped reads.

ADD REPLY
0
Entering edit mode

I did not understand the --score-min parameter very well from the manual. Can you suggest me some better values?

ADD REPLY
1
Entering edit mode

There is no such thing as "better values". Parameters have to be chosen according to your data. From the manual:

--score-min <func> : Sets a function governing the minimum alignment score needed for an alignment to be onsidered "valid" (i.e. good enough to report). This is a function of read length. For instance, specifying L,0,-0.6 sets the minimum core function f to f(x) = 0 + -0.6 * x, where x is the read length. See also: [setting function options]. The default is L,0,-0.2.

What is your maximum read length after trimming? Let's say it's 100 nt for example. In this function, this would be multiplied by the third parameter, and after the multiplication the second parameter will be added. The final value is the threshold for the alignment score that is applied to keep or discard a mapping result of a read. That score is negative and every time you have a mismatch or a gap the algorithm will subtract the penalties that you specified for gaps and mismatches.

I think you should calculate the alignment score threshold in a way that you get what you want. For example, if you want to call variants between strains of the same species, you might allow very few mismtaches because you know that the genomes are very similar. If you're mapping reads from different species to a reference, you might allow more gaps/mismtaches because biologically speaking they're expected.

ADD REPLY
0
Entering edit mode

Thanks Macspider, I got it.

ADD REPLY
1
Entering edit mode
6.5 years ago

I think the best way to proceed would be to:

  • map your RNASeq reads against the wheat genome with stringent criteria with hisat2 or bowtie2 using:
  --mp <int>,<int>   max and min penalties for mismatch; lower qual = lower penalty <6,2>
  --rdg <int>,<int>  read gap open, extend penalties (5,3)
  --rfg <int>,<int>  reference gap open, extend penalties (5,3)
  --score-min <func> min acceptable alignment score w/r/t read length
  
  • save the unmapped reads to a file using:
  --un <path>           write unpaired reads that didn't align to <path>
  --al <path>           write unpaired reads that aligned at least once to <path>
  --un-conc <path>      write pairs that didn't align concordantly to <path>
  --al-conc <path>      write pairs that aligned concordantly at least once to <path>
  
  • assemble those unmapped reads with trinity or any other RNASeq read assembler
  • run a quantification method
  • run a differential expression method :)
ADD COMMENT

Login before adding your answer.

Traffic: 2643 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6