Hi together,
I’m new in transcriptomics and need your help concerning htseq-counting. I have single-end data from Novaseq600. The alignment was performed with STAR, version 2.6.0c to a reference genome from the dog from ncbi. The percentage of uniquely-mapped reads was adequate and when watching at the bam files, the reads lie on the exon region.
For the counting with htseq (v. 0.9.1) I used the following command:
file=$(ls *.bam | sed -n ${SLURM_ARRAY_TASK_ID}p)
htseq-count -f bam -i gene_name $file /home/ubelix/vetsuisse/ng18q090/CanFam3.1-starIndex/ref_CanFam3.1_Nochr.modi.gtf >$file.count
The genome data files were all modified in this way, that they begin with chr . The lines of the GTF-file look like this:
chr1 Gnomon exon 251990 252562 . - . transcript_id "rna0"; gene_id "gene0"; gene_name "ENPP1";
My problem is the high number of no_feature after the htseq. 10 Mio makes around 95% of my mapped reads…:
__no_feature 10214983
__ambiguous 95
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 4758304
Because the bam-files look good, the problem seems not to be a DNA contamination with introns. Just as a try, I also changed the –s parameter and the –p parameter several times, but as expected in single end read data that did not change a lot.
What do you think? Where could lie the problem?
Thanks in advance!
What does
samtools view -H your.bam
show?Do you mean the following parameters:
It's just an example of one bam file...
As you can see your alignment contains chromosome names in
chr.N
format where as your GTF file has them inchrN
format. So if you edit your GTF file to match the names things should start working. Try this.Use
new.gtf
withhtseq-count
.Ou, I'm sorry... That was one of the elder .bam files... Here is the newer one
We had it first with chr.N and changed it than to chrN just to be sure...
But thanks for your answer... Do you have another idea?
Are you using the correct GTF/GFF file that was obtained from the same source where your reference sequence came from?
Unfortunately I get the modified genome-files from people who have more experience in transcriptomics than I have (I began 2 weeks ago with bioinformatics :P)... They told be to have taken the file from ncbi. And they also do not know an advice now... You suppose the genome-files to be the problem, right?
I was thinking about load them again newly from ncbi and modify them as little as possible. But I'm not sure, which file I should take. Especially I can not find one good fasta-file... It should come from ncbi because that's the best source for the dog.
CanFam GFF file from NCBI can be obtained here and it does not have
chr
designations so I am not sure where you got the one you are using.If you are going to do this again then you can find the genome file and the annotation on the dog genome page at NCBI.
It's probably from UCSC. At least the one from Ensembl has the normal
1
,2
, etc. chromosome names.I'm sorry for my late comment, on Friday I could only comment 5 times because I'm new on the forum. Thank you for your comments... We took the fasta-file from NCBI and changed it, because it's starting line begins with NC_00 or something like that. We changed it unfortunately to chrN. Then we had to change the gff and the gtf from NCBI also from N to chrN. That's why they have all chrN names. We changed all the files...
When I used -s no the no_features were around 9.5Mio but the ambigous get higher aroung 1500...
Look at the BAM file in IGV, I suspect you have a mismatch between the annotation and the genome used.
We looked at the bam files, to see if the reads are on the genes. How can I see there, if genome and annotation are matching?
If they aren't then the reads will generally not be on the genes. Alternatively, look at the BAM header (
samtools view -H
) and see if the chromosome lengths match whatever genome version you're expecting.