HTSeq errors in the command line
1
0
Entering edit mode
5.1 years ago
Morris_Chair ▴ 350

Hello everyone,

I want to count the reads using HTseq of BAM files from RNAseq aligned with TopHat. The experiment consists of 2 treated samples and 2 Untreated samples.

What is not clear to me is, Do I have to do the counting of the BAM files one by one or I can give them to HTseq altogether like I do with featureCounts?

Here is the code that I'm using

htseq-count -f bam -q -m union -s yes -t exon -i gene_id ~/RNAseq/mapping/star/Treated_1.bam Treated_2.bam Untreated_1.bam Untreated_2.bam ~/RNAseq/annotation/Homo_sapiens.GRCh37.75.gtf  > /home/RNAseq/htseq-count/countReads/

and here is what I get

-bash: /home/RNAseq/htseq-count/countReads/: Is a directory

Can you kindly tell me what is wrong ?

thank you

RNA-Seq • 2.9k views
ADD COMMENT
0
Entering edit mode

This apparently escaped my initially but I see that you mention you used tophat to do the read alignment.

That is a deprecated approach and it's advised to not use it anymore, even by the original authors of it. In stead you'd better use TopHat2 or even better HiSAT2, STAR , ...

ADD REPLY
0
Entering edit mode
5.1 years ago
A. Domingues ★ 2.7k

You are directing your output to a directory. Try this:

htseq-count -f bam -q -m union -s yes -t exon -i gene_id ~/RNAseq/mapping/star/Treated_1.bam Treated_2.bam Untreated_1.bam Untreated_2.bam ~/RNAseq/annotation/Homo_sapiens.GRCh37.75.gtf  > /home/RNAseq/htseq-count/countReads/all.counts
ADD COMMENT
0
Entering edit mode

Thank you, It's impressive how long it takes to process the files...

ADD REPLY
0
Entering edit mode

Yes, if there are lot's of reads in the bam files (and substantial GFF files) this will take a while to process, several hours perhaps, not days though.

That being said, I'm not convinced this is a valid cmdline. I'm assuming that also the three other bam files will need to be provided with their full path (unless they are in the current working dir).

I'm also not sure this will give the result you want to obtain: I think this cmdline will sum all counts over all 4 bam files and produce one single output count table and seeing that you provide treated and untreated samples you should at least keep those apart from each other. Personally I would run it 4 times in parallel , once with each input bam file (and later on sum the counts if need-be)

ADD REPLY
0
Entering edit mode

Hi lieven.sterck, the others bam files are in the same directory so I thought it was ok.

the weird thing is that I got 0 reads.

[@ws7910 htseq-count]$ head countReads/all.counts

ENSG00000000003 0 0 0 0 ENSG00000000005 0 0 0 0 ENSG00000000419 0 0 0 0 ENSG00000000457 0 0 0 0 ENSG00000000460 0 0 0 0 ENSG00000000938 0 0 0 0 ENSG00000000971 0 0 0 0 ENSG00000001036 0 0 0 0 ENSG00000001084 0 0 0 0 ENSG00000001167 0 0 0 0 why I got zero?

thank you

ADD REPLY
0
Entering edit mode

to avoid confusion: with 'same directory' you mean the same as where you run htseq-count ? of the same as the first file? In the former case you're fine indeed, in the latter not though

Without having insight in the specifics, that is most often caused by a wrong value for -i of by differences in the sequence names. Can you post a small extract of the gff file your using and an example of the sequence names used as reference?

ADD REPLY
0
Entering edit mode

Hi lieven.sterck, thank you for your answers, With the same directory I mean that all the bam files are in the same folder which is not the same where I run htseq-count

here is a small extract of the gtf used.

!genome-build GRCh37.p13
#!genome-version GRCh37
#!genome-date 2009-02
#!genome-build-accession NCBI:GCA_000001405.14
#!genebuild-last-updated 2013-09
1       pseudogene      gene    11869   14412   .       +       .       gene_id "ENSG00000223972"; gene_name ```"DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene";
1       processed_transcript    transcript      11869   14409   .       +       .       gene_id "ENSG00000223972"; `transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "````

It's not clear what you mean with "example of the sequence names used as reference" Anyways, looking online seems that htseq has disadvantages compared to featureCounts or Salmon, so I think I will go with one of this two.

Can I ask you if you use htseq for counting, why you choose this?

thank you

ADD REPLY
0
Entering edit mode

in that case you will need to prepend all those bam files with their full path .

Well, that the fasta files you used to build the reference DB to be used when aligning reads also has the names as in the gft file (== being 1 2 ... in your case apparently) not thus not chr1 or such.

Goh, let's say for historical reasons, though I indeed more often use FeatureCounts and/or Salmon nowadays. Keep in mind that using Salmon when mapping to a genome is not straightforward. (Mapping to transcriptome is quite easy)

ADD REPLY

Login before adding your answer.

Traffic: 2701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6