I want te be able to count the number of occurrences in a given sequence (for example ACTTTAG) in the GRCh38 reference genome. Is there an existing tool for doing this? Thanks!
You can use Biostring in Bioconductor.
The function countPattern should do the job. Just check if it is using a sliding window or not.
Jellyfish is pretty nice for kmer counting.
I wrote a simple script for finding patterns (regular expressions in fact) in fasta files, it's fastaRegexFinder.py and I also happen to mention it in this post Quadruplex sequence batch prediction
If you just want to count the number of occurrences you can do
fastaRegexFinder.py -f genome.fa -r 'ACTTTAG' | wc -l
You can also use bowtie1.
It is specially nice in finding (mapping) short sequences..
Login before adding your answer.