Entering edit mode
5.1 years ago
guotingfeng
▴
10
Hi dear all, I used HTseq to count the number of reads that mapped into each genes through Bwa-mem alignment, but I found no matter what parameter I changed, the output of HTseq is 100% no_feature. Could you help me figure out my mistake please?
1. I used Bwa-mem to align my reads to Bacillus 168 genome fasta format from NCBI.
module load BWA/0.7.17-GCCcore-6.3.0
module load SAMtools/1.7-GCCcore-6.3.0
pe1_1="/scratch/user/guotingfeng/RNAseq/2Mg_S9_L002_R1_001.fastq.gz"
pe1_2="/scratch/user/guotingfeng/RNAseq/2Mg_S9_L002_R2_001.fastq.gz"
read_group_id='2Mggroup'
library='pe'
sample='2Mg'
ref_genome="/scratch/user/guotingfeng/RNAseq/genomereference/twaindex/sequence.fasta"
output_bam="/scratch/user/guotingfeng//RNAseq/Bwa-mem/${sample}.sam"
if [ ! -f ${ref_genome}.bwt ]; then
bwa index $ref_genome
fi
platform='ILLUMINA'
**bwa \
mem \
-M \
-t $threads \
-R "@RG\tID:$readgroup\tLB:$library\tSM:$sample\tPL:$platform" $ref_genome $pe1_1 $pe1_2 | samtools view -h -Sb | \
samtools sort -m 2G -@ $threads -T $sample/sorted -n -o $output_bam**
I think output is fine for this process:
@HD VN:1.5 SO:queryname
@SQ SN:NC_000964.3 LN:4215606
@RG ID: LB:pe SM:1Mg PL:ILLUMINA
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -M -t 20 -R @RG\tID:\tLB:pe\tSM:1Mg\tPL:ILLUMINA /scratch/user/guotingfeng/RNAseq/genomereference/twaindex/sequence.fasta /scratch/user/guotingfeng/RNAseq/1Mg_S5_L002_R1_001.fastq.gz /scratch/user/guotingfeng/RNAseq/1Mg_S5_L002_R2_001.fastq.gz
K00333:90:H3TMTBBXY:2:1101:1133:28780 83 NC_000964.3 476475 60 75M = 476412 -138 GAAAAACGGGCGCGGTGAGAGAGTGCCGCGCGAAGTCTGTTATAATAACAGGATGAGCGTGAAAGAAAGAGAAGT NM:i:1 MD:Z:16T58 MC:Z:75M AS:i:70 XS:i:0
K00333:90:H3TMTBBXY:2:1101:1133:28780 163 NC_000964.3 476412 60 75M = 476475 138 NATTCACCAAACATCGCCCTCCGCGCAAACCGTCCCTCACGTCATTTTCTCAGAACTTGACCCGACAACCGGGCG
NM:i:9 MD:Z:0C3A2A12A13A20A5A3A2A6 MC:Z:75M AS:i:38 XS:i:0
2. The I used Bwa-mem output to analyze read number in the each genes through HTseq count.
module load HTSeq/0.9.1-intel-2017A-Python-2.7.12
pe1_1='/scratch/user/guotingfeng/RNAseq/Bwa-mem/1.sam'
pe1_2='/scratch/user/guotingfeng/RNAseq/genomereference/Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.37.gtf'
sample='/scratch/user/guotingfeng/RNAseq/HTseq/Bwa-mem/bwamem_1sam.txt'
**htseq-count -f sam -m union -s yes -r name -t exon -i gene_id --additional-attr=gene_name $pe1_1 $pe1_2 > $sample**
The output is weird: because all the reads are no_feature.
gene:BSU05210 ydeI 0
gene:BSU05220 ydeJ 0
**gene:BSU05230 ydeK 0**
gene:BSU05240 ydeL 0
**__no_feature 24000765**
__ambiguous 0
__too_low_aQual 258765
__not_aligned 33939
__alignment_not_unique 0
Do you have any ideas or suggests please? Thank you very much. Have a good day.
You already have a similar thread open: Why my Hisat2 result is 0% aligned concordantly and HTseq result is 100% not aligned.
Are you using the correct annotation file? If the file is not in right format you will not get any counts. Name of the chromosome has to match what is in your alignment file. What is the output of
head -5 /scratch/user/guotingfeng/RNAseq/genomereference/Bacillus_subtilis_subsp_subtilis_str_168.ASM904v1.37.gtf
?You only have 4 genes in your gtf?
Hi swbarnes2, no, I have lots. I just show four as an example. The number of count is same. Thanks.
Hi Genomax, Thank you for your reply. Here is my content of gtf file.
Does it look OK?
The file looks ok. But the chromosome name (
Chromosome
) in your GTF file does not match the one in your alignment file (NC_000964.3
). That is your likely problem with counting. You could replace the name in your GTF file and try counting again.Hi genomax, I have tried as you suggested. It is really improved. I can get some counts in several genes, but there are still 90% reads are no_feature. Here is what I did. 1. I change chromosome to NC_000964.3. NC_000964.3 ena exon 410 1750 . + . transcript_id "transcript:CAB11777"; gene_id "gene:BSU00010"; gene_name "dnaA";
NC_000964.3 ena CDS 410 1750 . + 0 transcript_id "transcript:CAB11777"; gene_id "gene:BSU00010"; gene_name "dnaA";
NC_000964.3 ena exon 1939 3075 . + . transcript_id "transcript:CAB11778"; gene_id "gene:BSU00020"; gene_name "dnaN";
NC_000964.3 ena CDS 1939 3075 . + 0 transcript_id "transcript:CAB11778"; gene_id "gene:BSU00020"; gene_name "dnaN";
NC_000964.3 ena exon 3206 3421 . + . transcript_id "transcript:CAB11779"; gene_id "gene:BSU00030"; gene_name "yaaA";
Then I run the HTseq-count as above script but change the gif file. Here is the output, I picked up several genes as example.
gene:BSU40000 yxnA 181
gene:BSU40010 yxaD 84
gene:BSU40021 yxzK 5
gene:BSU40022 yxaC 13
gene:BSU40030 yxaB 7
gene:BSU40040 glxK 44
gene:BSU40050 gntR 16
__no_feature 24555316
__ambiguous 179491
__too_low_aQual 296420
__not_aligned 48137
__alignment_not_unique 0
Could you help me please? Thank you very much.
Hi genomax, I change -s yes to -s reverse and it works now.
This is my output:
gene:BSU39790 yxcE 1864
gene:BSU39800 yxcD 502
gene:BSU39810 csbC 23383
gene:BSU39820 htpG 7631
gene:BSU39830 yxcA 610
gene:BSU39840 yxbG 2596
gene:BSU39850 yxbF 928
__no_feature 980707
__ambiguous 3984165
__too_low_aQual 296420
__not_aligned 48137
__alignment_not_unique 0
Thank you for your advice and help. I really appreciate it. You save me lots of time. Have a good time.
You are asking for trouble by using variable names with no relation to what they really are: