Help with Bowtie2
2
0
Entering edit mode
8.3 years ago

Hi All,

I have a quite big dataset consisting of 23nt long DNA sequences. Can I use Bowtie2 to find out sequences from this dataset which differ from a given query by upto 4 mismatches? The query will also be a 23nt long sequence. I am new to Bowtie. Kindly help! Thanks in advance.

alignment Bowtie2 • 2.1k views
ADD COMMENT
0
Entering edit mode

Hi hi.mon,

Can you please tell me if we can view alignment between the query and the hits obtained when using cd-hit-est-2d. It is difficult to guess the position of mismatches and gaps using the current format.

Thanks in advance!!

ADD REPLY
1
Entering edit mode
8.3 years ago

That corresponds to a minimum score of around -24, so --very-sensitive --score-min C,-24,0 -N 1 or something like that should work. You'll probably need to decrease the seed length to something like -L 15 and then play with some of the other settings. Are these miRNAs or do you really just have extremely short reads?

ADD COMMENT
0
Entering edit mode

Thanks for the reply. These are probable CRISPR binding sites.

ADD REPLY
0
Entering edit mode

Probably bowtie1/bwa-aln will work better with very short reads. With 15bp seeds, we will miss many 2-mismatch hits, let alone 4.

ADD REPLY
0
Entering edit mode

But bowtie 1 can search up to 3 mismatches only...

ADD REPLY
0
Entering edit mode
8.3 years ago
h.mon 35k

For what you want, you may use cdhit-est-2d:

cdhit-est-2d -i 1.fa -i2 2.fa -o out -c 0.8 -n 4

You may have to play around with the word size parameter (-n), but I think 4 should work.

I'm almost sure I saw a similar question to yours a few days ago but I can't find it. On this question, SWARM was mentioned and the OP seemed happy with the results.

ADD COMMENT
0
Entering edit mode

Hi,

I have my own dataset with millions of DNA sequences (each of which is 23 nt) within which I would like to find sequences similar to a new query. Can this be used for it? Can you please explain me the various parameters being used here?

Thanks!

ADD REPLY
0
Entering edit mode
  -i    input filename for db1 in fasta format, required
  -i2    input filename for db2 in fasta format, required
  -o    output filename, required
  -c    sequence identity threshold, default 0.9
  -n    word_length, default 10, see user's guide for choosing it

You have to use a small word size, as you want somewhat low similarity and have very short sequences. Set your identity threshold according to the level of similarity you want (19/23)

ADD REPLY

Login before adding your answer.

Traffic: 2711 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6