Hello! I am a novice in RNA-seq analysis. Recently, I followed the protocol "Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown" to perform RNA-seq analysis, but got gene lists with their q values higher than 0.1. I was confused, cuz my mentor already did that once and got gene lists with acceptable q values. Then I went to check the sam file produced by Hisat2, and found that the mapQ values are too uniform (0,1,60 only). So I was wondering if I got something wrong with my Hisat2 analysis part.
The following is my bash code for this analysis. Hope you could provide me some valuable advices! Very appreciate that!
export PATH=$HOME/bin:$PATH
hisat2 -p 4 --dta -x mm10/genome -U WTBMCold0h1_1.fq.gz -S WTBMCold0h1_1.sam
samtools view -h -q 15 WTBMCold0h1_1.sam -o 1WTBMCold0h1_1.sam
After deleting the WTBMCold0h1_1.sam file I change the name of 1WTBMCold0h1_1.sam into WTBMCold0h1_1.sam*
samtools sort -@ 4 -o WTBMCold0h1_1.bam WTBMCold0h1_1.sam
sh stringtie1.sh
stringtie1.sh:
for i in {0h1,0h4,4h4,5d1,5d4,12h1,12h4};do stringtie -p 4 -G genes.gtf -o WTBMCold${i}_1.gtf -l WTBMCold${i}_1 WTBMCold${i}_1.bam rm WTBMCold${i}_1.bam done
stringtie --merge -p 8 -G genes.gtf -o stringtie_merged.gtf mergelist.txt
sh stringtie2.sh
stringtie2.sh:
for i in {0h1,0h4,4h1,4h4,5d1,5d4,12h1,12h4};do stringtie -e -B -p 4 -G stringtie_merged.gtf -o ballgown/WTBMCold${i}_1/WTBMCold${i}_1.gtf WTBMCold${i}_1.bam done
R analysis:
library(ballgown) library(RSkittleBrewer) library(genefilter) library(dplyr) library(devtools) pheno_data_file <- "pheno_list.txt" pheno_data <- read.table(pheno_data_file, header=TRUE, colClasses = rep("character", 2)) pheno_data = pheno_data[order(pheno_data$ids),] bg <- ballgown(dataDir = "ballgown", samplePattern="WTBMCold", pData=pheno_data) bg_filt = subset(bg,"rowVars(texpr(bg)) >1",genomesubset=TRUE) results_transcripts = stattest(bg_filt, feature="transcript",covariate="timepoint", getFC=TRUE, meas="FPKM") results_genes = stattest(bg_filt, feature="gene", covariate="timepoint", getFC=TRUE, meas="FPKM") results_transcripts = data.frame(geneNames=ballgown::geneNames(bg_filt), geneIDs=ballgown::geneIDs(bg_filt), results_transcripts) results_transcripts = arrange(results_transcripts,qval) results_genes = arrange(results_genes,qval) write.csv(results_transcripts, "transcript_results.csv", row.names=FALSE) write.csv(results_genes, "gene_results.csv", row.names=FALSE) subset(results_transcripts,results_transcripts$qval<0.05) subset(results_genes,results_genes$qval<0.05)
How many replicates do you have? I think ballgown is recommended for more than 5 replicates, why not try deseq2?
Ok I will give it a try. Thanks buffo.
I have edited your post to make it more readable. Numbered lists are created by using a number, then a period, then a space. You were missing the space. Short sections of code can be formatted with back-ticks, and longer sections by starting lines with four spaces (or 8 in a numbered list) or using the code format button (the one labelled 101010).
I wouldn't worry about the mapQ numbers, I each aligner sets these differently and I think that HiSat only ever uses those three values.
Thanks a lot. I have removed the —dta parameters and try it once.Hope I can get desirable results.