Question

high q value of RNA-seq(By Hisat2-Stringtie-Ballgown)

0

Entering edit mode

5.1 years ago

245712608 • 0

Hello! I am a novice in RNA-seq analysis. Recently, I followed the protocol "Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown" to perform RNA-seq analysis, but got gene lists with their q values higher than 0.1. I was confused, cuz my mentor already did that once and got gene lists with acceptable q values. Then I went to check the sam file produced by Hisat2, and found that the mapQ values are too uniform (0,1,60 only). So I was wondering if I got something wrong with my Hisat2 analysis part.

The following is my bash code for this analysis. Hope you could provide me some valuable advices! Very appreciate that!

export PATH=$HOME/bin:$PATH
hisat2 -p 4 --dta -x mm10/genome -U WTBMCold0h1_1.fq.gz -S WTBMCold0h1_1.sam
samtools view -h -q 15 WTBMCold0h1_1.sam -o 1WTBMCold0h1_1.sam

After deleting the WTBMCold0h1_1.sam file I change the name of 1WTBMCold0h1_1.sam into WTBMCold0h1_1.sam*
samtools sort -@ 4 -o WTBMCold0h1_1.bam WTBMCold0h1_1.sam

sh stringtie1.sh

stringtie1.sh:

for i in {0h1,0h4,4h4,5d1,5d4,12h1,12h4};do
    stringtie -p 4 -G genes.gtf -o WTBMCold${i}_1.gtf -l WTBMCold${i}_1 WTBMCold${i}_1.bam
    rm WTBMCold${i}_1.bam
done

stringtie --merge -p 8 -G genes.gtf -o stringtie_merged.gtf mergelist.txt

sh stringtie2.sh

stringtie2.sh:

for i in {0h1,0h4,4h1,4h4,5d1,5d4,12h1,12h4};do
    stringtie -e -B -p 4 -G stringtie_merged.gtf -o ballgown/WTBMCold${i}_1/WTBMCold${i}_1.gtf WTBMCold${i}_1.bam
done

R analysis:

library(ballgown)

library(RSkittleBrewer)

library(genefilter)

library(dplyr)

library(devtools)

pheno_data_file <- "pheno_list.txt"

pheno_data <- read.table(pheno_data_file, header=TRUE, colClasses = rep("character", 2))

pheno_data = pheno_data[order(pheno_data$ids),]

bg <- ballgown(dataDir = "ballgown", samplePattern="WTBMCold", pData=pheno_data)

bg_filt = subset(bg,"rowVars(texpr(bg)) >1",genomesubset=TRUE)

results_transcripts = stattest(bg_filt, feature="transcript",covariate="timepoint", getFC=TRUE, meas="FPKM")

results_genes = stattest(bg_filt, feature="gene", covariate="timepoint",  getFC=TRUE, meas="FPKM")

results_transcripts = data.frame(geneNames=ballgown::geneNames(bg_filt), geneIDs=ballgown::geneIDs(bg_filt), results_transcripts)

results_transcripts = arrange(results_transcripts,qval)

results_genes = arrange(results_genes,qval)

write.csv(results_transcripts, "transcript_results.csv", row.names=FALSE)

write.csv(results_genes, "gene_results.csv", row.names=FALSE)

subset(results_transcripts,results_transcripts$qval<0.05)

subset(results_genes,results_genes$qval<0.05)

RNA-Seq • 1.9k views

ADD COMMENT • link updated 5.1 years ago by i.sudbery 19k • written 5.1 years ago by 245712608 • 0

0

Entering edit mode

How many replicates do you have? I think ballgown is recommended for more than 5 replicates, why not try deseq2?

ADD REPLY • link 5.1 years ago by Buffo ★ 2.4k

0

Entering edit mode

Ok I will give it a try. Thanks buffo.

ADD REPLY • link 5.1 years ago by 245712608 • 0

0

Entering edit mode

I have edited your post to make it more readable. Numbered lists are created by using a number, then a period, then a space. You were missing the space. Short sections of code can be formatted with back-ticks, and longer sections by starting lines with four spaces (or 8 in a numbered list) or using the code format button (the one labelled 101010).

ADD REPLY • link 5.1 years ago by i.sudbery 19k

0

Entering edit mode

I wouldn't worry about the mapQ numbers, I each aligner sets these differently and I think that HiSat only ever uses those three values.

ADD REPLY • link 5.1 years ago by i.sudbery 19k

0

Entering edit mode

Thanks a lot. I have removed the —dta parameters and try it once.Hope I can get desirable results.

ADD REPLY • link 5.1 years ago by 245712608 • 0