Question

Help with Bowtie2

0

Entering edit mode

8.3 years ago

jaspreetk.dhanjal ▴ 20

Hi All,

I have a quite big dataset consisting of 23nt long DNA sequences. Can I use Bowtie2 to find out sequences from this dataset which differ from a given query by upto 4 mismatches? The query will also be a 23nt long sequence. I am new to Bowtie. Kindly help! Thanks in advance.

alignment Bowtie2 • 2.1k views

ADD COMMENT • link 7.9 years ago by jaspreetk.dhanjal ▴ 20

0

Entering edit mode

Hi hi.mon,

Can you please tell me if we can view alignment between the query and the hits obtained when using cd-hit-est-2d. It is difficult to guess the position of mismatches and gaps using the current format.

Thanks in advance!!

ADD REPLY • link 7.9 years ago by jaspreetk.dhanjal ▴ 20

score 1 · Answer 1 · 2016-01-09

1

Entering edit mode

8.3 years ago

Devon Ryan 104k

That corresponds to a minimum score of around -24, so --very-sensitive --score-min C,-24,0 -N 1 or something like that should work. You'll probably need to decrease the seed length to something like -L 15 and then play with some of the other settings. Are these miRNAs or do you really just have extremely short reads?

ADD COMMENT • link 8.3 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks for the reply. These are probable CRISPR binding sites.

ADD REPLY • link 8.3 years ago by jaspreetk.dhanjal ▴ 20

0

Entering edit mode

Probably bowtie1/bwa-aln will work better with very short reads. With 15bp seeds, we will miss many 2-mismatch hits, let alone 4.

ADD REPLY • link 7.9 years ago by lh3 33k

0

Entering edit mode

But bowtie 1 can search up to 3 mismatches only...

ADD REPLY • link 7.9 years ago by jaspreetk.dhanjal ▴ 20

Ram · Answer 2 · 2016-01-09

0

Entering edit mode

8.3 years ago

h.mon 35k

For what you want, you may use cdhit-est-2d:

cdhit-est-2d -i 1.fa -i2 2.fa -o out -c 0.8 -n 4

You may have to play around with the word size parameter (-n), but I think 4 should work.

I'm almost sure I saw a similar question to yours a few days ago but I can't find it. On this question, SWARM was mentioned and the OP seemed happy with the results.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by h.mon 35k

0

Entering edit mode

Hi,

I have my own dataset with millions of DNA sequences (each of which is 23 nt) within which I would like to find sequences similar to a new query. Can this be used for it? Can you please explain me the various parameters being used here?

Thanks!

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by jaspreetk.dhanjal ▴ 20

0

Entering edit mode

  -i    input filename for db1 in fasta format, required
  -i2    input filename for db2 in fasta format, required
  -o    output filename, required
  -c    sequence identity threshold, default 0.9
  -n    word_length, default 10, see user's guide for choosing it

You have to use a small word size, as you want somewhat low similarity and have very short sequences. Set your identity threshold according to the level of similarity you want (19/23)

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by h.mon 35k