Htseq: High number of no_feature
0
0
Entering edit mode
5.7 years ago
nadja.g • 0

Hi together,

I’m new in transcriptomics and need your help concerning htseq-counting. I have single-end data from Novaseq600. The alignment was performed with STAR, version 2.6.0c to a reference genome from the dog from ncbi. The percentage of uniquely-mapped reads was adequate and when watching at the bam files, the reads lie on the exon region.

For the counting with htseq (v. 0.9.1) I used the following command:

file=$(ls *.bam | sed -n ${SLURM_ARRAY_TASK_ID}p)

htseq-count -f bam -i gene_name $file /home/ubelix/vetsuisse/ng18q090/CanFam3.1-starIndex/ref_CanFam3.1_Nochr.modi.gtf >$file.count

The genome data files were all modified in this way, that they begin with chr . The lines of the GTF-file look like this:

chr1    Gnomon  exon    251990  252562  .       -       .       transcript_id "rna0"; gene_id "gene0"; gene_name "ENPP1";

My problem is the high number of no_feature after the htseq. 10 Mio makes around 95% of my mapped reads…:

__no_feature    10214983

__ambiguous     95

__too_low_aQual 0

__not_aligned   0

__alignment_not_unique  4758304

Because the bam-files look good, the problem seems not to be a DNA contamination with introns. Just as a try, I also changed the –s parameter and the –p parameter several times, but as expected in single end read data that did not change a lot.

What do you think? Where could lie the problem?

Thanks in advance!

RNA-Seq htseq-count • 3.1k views
ADD COMMENT
0
Entering edit mode

What does samtools view -H your.bam show?

ADD REPLY
0
Entering edit mode

Do you mean the following parameters:

@HD     VN:1.4  SO:coordinate
@SQ     SN:chr.10       LN:69331447
@SQ     SN:chr.11       LN:74389097
@SQ     SN:chr.12       LN:72498081
@SQ     SN:chr.13       LN:63241923
@SQ     SN:chr.14       LN:60966679
@SQ     SN:chr.15       LN:64190966
@SQ     SN:chr.16       LN:59632846
@SQ     SN:chr.17       LN:64289059
@SQ     SN:chr.18       LN:55844845
@SQ     SN:chr.19       LN:53741614
@SQ     SN:chr.1        LN:122678785
@SQ     SN:chr.20       LN:58134056
@SQ     SN:chr.21       LN:50858623
@SQ     SN:chr.22       LN:61439934
@SQ     SN:chr.23       LN:52294480
@SQ     SN:chr.24       LN:47698779
@SQ     SN:chr.25       LN:51628933
@SQ     SN:chr.26       LN:38964690
@SQ     SN:chr.27       LN:45876710
@SQ     SN:chr.28       LN:41182112
@SQ     SN:chr.29       LN:41845238
@SQ     SN:chr.2        LN:85426708
@SQ     SN:chr.30       LN:40214260
@SQ     SN:chr.31       LN:39895921
@SQ     SN:chr.32       LN:38810281
@SQ     SN:chr.33       LN:31377067
@SQ     SN:chr.34       LN:42124431
@SQ     SN:chr.35       LN:26524999
@SQ     SN:chr.36       LN:30810995
@SQ     SN:chr.37       LN:30902991
@SQ     SN:chr.38       LN:23914537
@SQ     SN:chr.3        LN:91889043
@SQ     SN:chr.4        LN:88276631
@SQ     SN:chr.5        LN:88915250
@SQ     SN:chr.6        LN:77573801
@SQ     SN:chr.7        LN:80974532
@SQ     SN:chr.8        LN:74330416
@SQ     SN:chr.9        LN:61074082
@SQ     SN:chr.MT       LN:16727
@SQ     SN:chr.X        LN:123869142
@PG     ID:STAR PN:STAR VN:STAR_2.6.0c  CL:STAR   --runThreadN 8   --genomeDir /home/ubelix/vetsuisse/ng18q090/CanFam3.1-starIndex/genome_index   --readFilesIn F9_L2_R1_001_L4WxcGBKBUlB.fastq.gz      --readFilesCommand zcat      --outFileNamePrefix ./F9_L2_R1_001_L4WxcGBKBUlB   --outSAMtype BAM   SortedByCoordinate      --outFilterType BySJout   --outFilterMultimapNmax 50   --outFilterMismatchNmax 2   --outFilterMismatchNoverLmax 0.04   --alignIntronMin 20   --alignIntronMax 1000000   --alignMatesGapMax 1000000   --alignSJoverhangMin 1
@CO     user command line: STAR --runThreadN 8 --genomeDir /home/ubelix/vetsuisse/ng18q090/CanFam3.1-starIndex/genome_index --readFilesIn F9_L2_R1_001_L4WxcGBKBUlB.fastq.gz --readFilesCommand zcat --outFilterType BySJout --outFilterMultimapNmax 50 --alignSJoverhangMin 1 --outFilterMismatchNmax 2 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outFileNamePrefix ./F9_L2_R1_001_L4WxcGBKBUlB --outSAMtype BAM SortedByCoordinate

It's just an example of one bam file...

ADD REPLY
1
Entering edit mode

As you can see your alignment contains chromosome names in chr.N format where as your GTF file has them in chrN format. So if you edit your GTF file to match the names things should start working. Try this.

sed 's/chr/chr\./'  your.gtf > new.gtf

Use new.gtf with htseq-count.

ADD REPLY
0
Entering edit mode

Ou, I'm sorry... That was one of the elder .bam files... Here is the newer one

@HD     VN:1.4  SO:coordinate
@SQ     SN:chr10        LN:69331447
@SQ     SN:chr11        LN:74389097
@SQ     SN:chr12        LN:72498081
@SQ     SN:chr13        LN:63241923
@SQ     SN:chr14        LN:60966679
@SQ     SN:chr15        LN:64190966
@SQ     SN:chr16        LN:59632846
@SQ     SN:chr17        LN:64289059
@SQ     SN:chr18        LN:55844845
@SQ     SN:chr19        LN:53741614
@SQ     SN:chr1 LN:122678785
@SQ     SN:chr20        LN:58134056
@SQ     SN:chr21        LN:50858623
@SQ     SN:chr22        LN:61439934
@SQ     SN:chr23        LN:52294480
@SQ     SN:chr24        LN:47698779
@SQ     SN:chr25        LN:51628933
@SQ     SN:chr26        LN:38964690
@SQ     SN:chr27        LN:45876710
@SQ     SN:chr28        LN:41182112
@SQ     SN:chr29        LN:41845238
@SQ     SN:chr2 LN:85426708
@SQ     SN:chr30        LN:40214260
@SQ     SN:chr31        LN:39895921
@SQ     SN:chr32        LN:38810281
@SQ     SN:chr33        LN:31377067
@SQ     SN:chr34        LN:42124431
@SQ     SN:chr35        LN:26524999
@SQ     SN:chr36        LN:30810995
@SQ     SN:chr37        LN:30902991
@SQ     SN:chr38        LN:23914537
@SQ     SN:chr3 LN:91889043
@SQ     SN:chr4 LN:88276631
@SQ     SN:chr5 LN:88915250
@SQ     SN:chr6 LN:77573801
@SQ     SN:chr7 LN:80974532
@SQ     SN:chr8 LN:74330416
@SQ     SN:chr9 LN:61074082
@SQ     SN:chrMT        LN:16727
@SQ     SN:chrX LN:123869142
@PG     ID:STAR PN:STAR VN:STAR_2.6.0c  CL:STAR   --runThreadN 8   --genomeDir /home/ubelix/vetsuisse/ng18q090/GeneDiffAna/genome_index   --readFilesIn F9_L3_R1_001_HllJvKZkHwYR.fastq.gz      --readFilesCommand zcat      --outFileNamePrefix ./F9_L3_R1_001_HllJvKZkHwYR   --outSAMtype BAM   SortedByCoordinate      --outFilterType BySJout   --outFilterMultimapNmax 50   --outFilterMismatchNmax 2   --outFilterMismatchNoverLmax 0.04   --alignIntronMin 20   --alignIntronMax 1000000   --alignMatesGapMax 1000000   --alignSJoverhangMin 1
@CO     user command line: STAR --runThreadN 8 --genomeDir /home/ubelix/vetsuisse/ng18q090/GeneDiffAna/genome_index --readFilesIn F9_L3_R1_001_HllJvKZkHwYR.fastq.gz --readFilesCommand zcat --outFilterType BySJout --outFilterMultimapNmax 50 --alignSJoverhangMin 1 --outFilterMismatchNmax 2 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outFileNamePrefix ./F9_L3_R1_001_HllJvKZkHwYR --outSAMtype BAM SortedByCoordinate

We had it first with chr.N and changed it than to chrN just to be sure...

But thanks for your answer... Do you have another idea?

ADD REPLY
0
Entering edit mode

Are you using the correct GTF/GFF file that was obtained from the same source where your reference sequence came from?

ADD REPLY
0
Entering edit mode

Unfortunately I get the modified genome-files from people who have more experience in transcriptomics than I have (I began 2 weeks ago with bioinformatics :P)... They told be to have taken the file from ncbi. And they also do not know an advice now... You suppose the genome-files to be the problem, right?

I was thinking about load them again newly from ncbi and modify them as little as possible. But I'm not sure, which file I should take. Especially I can not find one good fasta-file... It should come from ncbi because that's the best source for the dog.

ADD REPLY
0
Entering edit mode

CanFam GFF file from NCBI can be obtained here and it does not have chr designations so I am not sure where you got the one you are using.

If you are going to do this again then you can find the genome file and the annotation on the dog genome page at NCBI.

ADD REPLY
0
Entering edit mode

It's probably from UCSC. At least the one from Ensembl has the normal 1, 2, etc. chromosome names.

ADD REPLY
0
Entering edit mode

I'm sorry for my late comment, on Friday I could only comment 5 times because I'm new on the forum. Thank you for your comments... We took the fasta-file from NCBI and changed it, because it's starting line begins with NC_00 or something like that. We changed it unfortunately to chrN. Then we had to change the gff and the gtf from NCBI also from N to chrN. That's why they have all chrN names. We changed all the files...

ADD REPLY
0
Entering edit mode

When I used -s no the no_features were around 9.5Mio but the ambigous get higher aroung 1500...

ADD REPLY
0
Entering edit mode

Look at the BAM file in IGV, I suspect you have a mismatch between the annotation and the genome used.

ADD REPLY
0
Entering edit mode

We looked at the bam files, to see if the reads are on the genes. How can I see there, if genome and annotation are matching?

ADD REPLY
0
Entering edit mode

If they aren't then the reads will generally not be on the genes. Alternatively, look at the BAM header (samtools view -H) and see if the chromosome lengths match whatever genome version you're expecting.

ADD REPLY

Login before adding your answer.

Traffic: 2616 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6