Hisat2 with HTSeq-Count
0
1
Entering edit mode
6.2 years ago
dacacis ▴ 10

Hello everyone!

I am working with Tophat2 and HTSeq-count until now but, I want to use HISAT2 and learn about it. For that, I have been trying and working with this tool this week. Everything was fine until I tried to calculate the count from the SAM files using HTSeq-count. When I used this tool for the count, all the transcript counts in the output files are 0. This represents a real problem for me because I need to use HTSeq-count to keep the same distribution than the data that I already have.

So, Is it possible to use these two tools together? or Is there another tool that mantain the same distribution in the output file that HTSeq-count but works with HISAT2?

Both HISAT2 and HTSeq-count are in their last version.

Thank you in advance for your time.

sequencing alignment count RNA-Seq genome • 6.0k views
ADD COMMENT
1
Entering edit mode

The first thing that you should check is if the chromosome identifiers in your fasta genome used for creating the index are the same as in your gtf used for counting. Most common mistake.

ADD REPLY
0
Entering edit mode

I am using the release 89 of the human genome for both the fasta and the gtf files. The FASTA file is the Homo_sapiens.GRCh38.cdna and the GFT is the Homo_sapiens.GRCh38.89

ADD REPLY
1
Entering edit mode

You aligned to the cdna? That's not right, you should align to the genome. The GTF doesn't correspond to the cdna fasta, only to the genome fasta.

ADD REPLY
0
Entering edit mode

Ok, I though that the CDNA is used with tools like HISAT2, SALMON, KALLISTO,etc... So I'll go to try with the whole genome such as in TOPHAT2. I hope this fix my problem. Thank you!

ADD REPLY
1
Entering edit mode

The CDS FASTA transcriptome is indeed used for Kallisto and Salmon, but not HISAT2

ADD REPLY
0
Entering edit mode

I just aligned the sample with the whole genome, but the problem persist. I have zeros in all the counts of each transcript. This is the message that I obtain when I create the SAM file before the count file:

Converting SRR1663209 sample to SAM file...
14605626 reads; of these:
  14605626 (100.00%) were paired; of these:
    10936288 (74.88%) aligned concordantly 0 times
    1176660 (8.06%) aligned concordantly exactly 1 time
    2492678 (17.07%) aligned concordantly >1 times
    ----
    10936288 pairs aligned concordantly 0 times; of these:
      24476 (0.22%) aligned discordantly 1 time
    ----
    10911812 pairs aligned 0 times concordantly or discordantly; of these:
      21823624 mates make up the pairs; of these:
        20311900 (93.07%) aligned 0 times
        786761 (3.61%) aligned exactly 1 time
        724963 (3.32%) aligned >1 times
30.47% overall alignment rate

And this is the output of htseq-count:

Warning: 215252 reads with missing mate encountered.
22742460 SAM alignment pairs processed.

I am using the Homo_sapiens.GRCh38.90 for both the FASTA and GTF files. Is it possible that the version of the genome should be the problem?

ADD REPLY
1
Entering edit mode

Why do you need to use HISAT2? - to identify novel transcripts?

What I would do is the following:

  1. align with HISAT2
  2. Create novel transcriptome GTF with StringTie (guided by the existing known transcriptome GTF)
  3. Commence the HT-seq pipeline from the very beginning (FASTQ files) but use your new GTF
ADD REPLY
0
Entering edit mode

I am working integrating microarray gene expression with RNA-seq gene expression. For that, I use tophat with htseq-count for the read counts. The problem is that tophat2 is very slowly, for that reason I want to accelerate the pipeline with Hisat2.

HTSeq-count is the only tool that maintain the gene expression values in the same range and distribution than microarray, at least as far as I know. For that reason I need solve this problem.

Thank you again for your supporting!

ADD REPLY
0
Entering edit mode

...but, TopHat2 / HISAT2 are mainly designed for de novo transcript identification and de novo splice isoform identification. They can also be used for standard RNA-seq differential expression analysis, of course.

HTseq does not produce counts on the same distribution as microarray. HTseq produces expression values that follow a negative binomial distributon, whilst microarray expression values are binomially-distributed log (base 2) values.

In my opinion, it is not a good idea to attempt to merge RNA-seq data with microarray data - sorry / lo siento. It would be better to use one dataset as the training and the other as a validation.

Kevin

ADD REPLY
1
Entering edit mode

A large fraction of your reads do not appear to align. You should investigate that first (grab a sample of those reads and blast at NCBI to see if you have a problem of contamination of some sort) before worrying about 0 counts.

ADD REPLY

Login before adding your answer.

Traffic: 2498 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6