Filtering low expression transcripts after a normalized assembly : which fastq file should I used?
0
0
Entering edit mode
7.7 years ago
qingxiangg ▴ 40

Hi all,

I have quite large datasets for transcriptome assembly, so I use --normalize_reads when assembling;

 nohup Trinity --seqType fq --left F_clean.fastq --right R_clean.fastq --max_memory 40G --min_contig_length 250 --group_pairs_distance 250 --jaccard_clip --normalize_reads &

Then I tried to use align_and_estimate_abundance.pl for calculating the abundance

nohup align_and_estimate_abundance.pl --transcripts Trinity.fasta --seqType fa --left left.fa --right right.fa --est_method RSEM --aln_method bowtie2 --trinity_mode --output_dir rsem_outdir --prep_reference  --thread_count 4 > /dev/null 2>&1 &

So my question is:

When mapping the raw reads to Trinity.fasta, which raw reads should I used? The normalized reads in insilico_read_normalization? Or the reads before normalization?

(In fact I tried them both, and I got zero of FPKM and TPM when I used the reads before normalization)

Thanks!

RNA-Seq • 2.3k views
ADD COMMENT
1
Entering edit mode

I guess the normalisation is to reduce the data size to be used for assembly. If you would like to quantify, you should use the raw data to align and quantify.

What zero FPKM ? for all the transcripts ? There must be something wrong.

ADD REPLY
0
Entering edit mode

Hi Atla and Joe thanks for your suggestion

Yes, it's a ALL-FPKM-ZERO problem, I've googled and done troubleshooting these days. and I think I've got something wrong with the bam file.

I choose the non-normalised data and try to run RSEM-1.2.29 by myself instead of through Trinity.

I didn't realize it's the problem of my bam file until I got the following error message when using rsem-sam-validator and convert-sam-for-rsem

Only find one mate for paired-end read ST-E00126:217:HVVCNCCXX:2:1101:20902:1907! The input file is not valid!

Number of first and second mates in read ST-E00126:217:HVVCNCCXX:2:1101:991:10943's full alignments (both mates are aligned) are not matched!

It seems that RSEM can't find the mate for ST-E00126:217:HVVCNCCXX:2:1101:20902:1907 in bam file, but when I look into the file:

ST-E00126:217:HVVCNCCXX:2:1101:20902:1907 163 TRINITY_DN85708_c3_g1_i1 206 255 94M = 274 218
ST-E00126:217:HVVCNCCXX:2:1101:21044:1907 83 TRINITY_DN85708_c3_g1_i1 274 255 150M = 206 -218

The ST-E00126:217:HVVCNCCXX:2:1101:20902:1907 does have a mate, although in the name of ST-E00126:217:HVVCNCCXX:2:1101:21044:1907

where is the mate pair of 20902:1907? Unmapped?? I can't find it in the bam results.

Do you have any idea about this? Maybe something wrong with my raw data??

Attached is a small fragment of sam file containing reads that triggering this error.

Thanks

https://sourceforge.net/projects/bam-file/files/expp.bam/download

ADD REPLY
0
Entering edit mode

Have you mapped your raw reads back to the assembled transcriptome? What percentage of input reads will map back?

ADD REPLY
0
Entering edit mode

Hi Joe

my non-normalized raw reads: 56597073 normalized raw reads: 7709656 paired reads I finally choose the raw data to align and quantify according to Atla's suggestion (using perl script in trinity which mapping reads by bowtie, quantifying by RSEM, but still got the ALL FPKM ZERO problem).

I use the following cmd to check the output bam file, samtools flagstat bowtie.bam

here is what I got, 3655162 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 3655162 + 0 mapped (100.00%:-nan%) 3655162 + 0 paired in sequencing 1827581 + 0 read1 1827581 + 0 read2 3655162 + 0 properly paired (100.00%:-nan%) 3655162 + 0 with itself and mate mapped 0 + 0 singletons (0.00%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)

Seems like a 100% mapping, but 3655162 is far more less than 7709656*2.

ADD REPLY

Login before adding your answer.

Traffic: 2701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6