Problems getting read number per gene after sequencing
2
0
Entering edit mode
6.7 years ago
ThePresident ▴ 180

I am dealing with Illumina genomic (not transcriptomic) paired-end sequencing. For some purpose, I'd like to have a number of reads that cover a particular feature (let's say a gene) defined in a BED or GFF file.

I have sorted SAM and BAM files.

My BED files looks like this:

RandomSeqSTD    500000  505001
RandomSeqSTD    1000000 1001001
RandomSeqSTD    1500000 1500501
RandomSeqSTD    2000000 2000251
RandomSeqSTD    2500000 2500101

My GFF file (which I made) looks like this:

##gff-version 3
##sequence-region RanSeqSTD 1 4659625
RanSeqSTD   random  CDS 500000  505000  .   .   .   gene_id=5000bp;transcript_id=same
RanSeqSTD   random  CDS 1000000 1001000 .   .   .   gene_id=1000bp;transcript_id=same
RanSeqSTD   random  CDS 1500000 1500500 .   .   .   gene_id=500bp;transcript_id=same
RanSeqSTD   random  CDS 2000000 2000250 .   .   .   gene_id=250bp;transcript_id=same
RanSeqSTD   random  CDS 2500000 2500100 .   .   .   gene_id=100bp;transcript_id=same

First I tried getting read number for each of these features with:

bedtools multicov -bams *.bam -bed BEDfile.bed

I got a read count and in theory this might be enough, however the problem is that I have limited control over what is considerate to be part of the feature. So I tried using HTSeq with the sorted SAM file and the above GFF file with the following command line:

htseq-count -s no -m union -i gene_id -t CDS STD100_sorted.sam RanSeqGFF.gff3 > HTSeq.txt

However, I keep getting count "0" for all gene_id features.

Any idea why HTSeq gives me this and any idea how this can be done differently?

Note: I am fully aware that HTSeq was not developed for genomic sequencing and thus should not be used for this.

Thanks TP

HTSeq bedtools multicov • 2.5k views
ADD COMMENT
0
Entering edit mode

I think, gene_id and transcript_id need to be separated by a space (no '=' sign) and the actual IDs are escaped using " characters:

gene_id "ENSG00000223972"; transcript_id "ENST00000456328";

That's at least how I parse these IDs in my own tool alfred which should do the read counting you are looking for (requires a position-sorted BAM file):

/src/alfred count -i gene_id -f CDS -g RanSeqGFF.gtf.gz STD100_sorted.bam

ADD REPLY
0
Entering edit mode

What you are describing is a GTF file, I believe. It's worth trying though. I'll update the comment when I get the result.

ADD REPLY
1
Entering edit mode
6.7 years ago
glihm ▴ 660

You did not specified the sorting order of your pair-end alignments.

  -r {pos,name}, --order {pos,name}
                    'pos' or 'name'. Sorting order of <alignment_file>
                    (default: name). Paired-end sequencing data must be
                    sorted either by position or by read name, and the
                    sorting order must be specified. Ignored for single-
                    end data.

I guess that you did a default sorting like:

samtools sort -o sorted.sam input.sam

So, you probably need to add "-r" option like:

htseq-count -r pos -s no -m union -i gene_id -t CDS STD100_sorted.sam RanSeqGFF.gff3 > HTSeq.txt

If you already sorted your sam file with names, and not positions, you should check the references used in the sam file and the seqid of you GFF3 formatted file.

ADD COMMENT
1
Entering edit mode

You were right! Initially, I did sort by the name however I did a rookie mistake of not looking at my reference and seqid. In my GFF file I use "RanSeqSTD" but my sam file harbors "RandomSeqSTD". Anyway, problem solved. Thanks!

ADD REPLY
0
Entering edit mode

we all did ! :D Happy to see problem solved ! ;)

ADD REPLY
1
Entering edit mode
6.7 years ago

Just convert the BED to SAF and use featureCounts. Its pretty straightforward.

awk -v OFS="\t" 'BEGIN {"GeneID","Chr","Start","End","Strand"} { print "Gene_"NR, $1, $2, $3, "."}' in.bed > out.saf
ADD COMMENT
0
Entering edit mode

featureCounts seems very similar to bedtools multicov. My only concern with these tools is that they count a read as long as it overlaps the feature (gene, exon etc.) by 1bp which in my opinion is a little bit too loose as cutoff. Therefore, the idea to use HTSeq as it seems more stringent. However, it might be enough for what I want to do. Thanks!

ADD REPLY
1
Entering edit mode

You can always change the default parameters −−minOverlap and also if you want to make the quantification bit stringent by utilizing other arguments like −−fracOverlap , −−largestOverlap ... in my personal opinion featureCounts is more flexible by having options for many parameters and also more efficient compared to HTSeq.

ADD REPLY
0
Entering edit mode

I missed these options when I read about featureCounts. It looks good indeed, I'll give it a try. Thanks!

ADD REPLY
1
Entering edit mode

Definitely its not very similar to multiCov

ADD REPLY
0
Entering edit mode

Why do you say so? They perform identical function which is to count the number of reads that overlap a set of features defined by genomic boundaries. Additionally

featureCounts: "A read is said to overlap a feature if at least one read base is found to overlap the feature" multicovs: "-f argument : Minimum overlap required as a fraction of each A. Default is 1E-9 (i.e., 1bp)."

ADD REPLY
1
Entering edit mode

Performing the identical function, doesn't mean they both are very similar. The limitations of multiCov and the options in featureCounts make a big difference,

ADD REPLY

Login before adding your answer.

Traffic: 2004 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6