Question

Distinguishing Between Real Dna Strings Vs Random Dna Strings

3

Entering edit mode

12.1 years ago

Zakir Naik ▴ 30

I have come up with a in-silico method to differentiate between a real DNA string (originating from a real genome) and a computer generated random DNA string. The program does not perform any alignments with sequences in existing databases and takes no more than a minute (on a 2.33 GHz, 2GB RAM) desktop for completing the job.

Only condition: The length of the input strings should be at least 5000 characters.

Are there any practical applications of this method ?

dna genomics • 3.3k views

ADD COMMENT • link updated 3.5 years ago by Biostar 20 • written 12.1 years ago by Zakir Naik ▴ 30

1

Entering edit mode

Did you just do it for fun? Just curious.

ADD REPLY • link 12.1 years ago by Madelaine Gogol 5.3k

0

Entering edit mode

Guess it was fun to do. But did you try your method on structured english text vs randomly generated alphanumeric strings ? Me too just curious.

ADD REPLY • link 12.1 years ago by Monzoor ▴ 300

0

Entering edit mode

Yes, I did start this for fun. But the near 100% accurate results I got made me think whether the overall method had any practical value. I've now got to modify the technique so that it can work with smaller strings now. Though it may not be easy, the modified method may be useful for weeding out sequencing artefacts in NGS data sets

ADD REPLY • link 12.1 years ago by Zakir Naik ▴ 30

0

Entering edit mode

Thanks Monzoor. Will check if I can tune it for English text vs random text. BTW what are the applications of such a program (if ever it works for English vs random text)

ADD REPLY • link 12.1 years ago by Zakir Naik ▴ 30

score 4 · Answer 1 · 2012-03-27

4

Entering edit mode

12.1 years ago

brentp 24k

I assume this is some sort of supervised learning?

I can think of some questions I'd try to answer (which ignore any biology):

Can you extract from your classifier something that tells you what it is that differentiates real sequences from generated sequences?

How much can you permute the a real sequence until it is classified as "fake"? Does it take 10 mutations? 100 mutations? 1 inversion?

Does the classifier report a probability? So can you find sequences that have low/high probability?

How low can you reduce the size of the learning set and still get good results? If you only use sequences from fungus, can it still recognize real human sequences?

ADD COMMENT • link 12.1 years ago by brentp 24k

0

Entering edit mode

Your comments were very interesting. 1. Yes it is supervised learning. Calculations were based on overall entropy of given sequences. It had nothing to do with biological features like GC content, codon usage etc. 2. Permutation value is something I will try out and let you know. 3. The idea of adding a probability score is the best suggestion, I have received so far. Will do that right away 4. Both fungus and human sequences are real biological sequences. My method as of now cannot distinguish between them. But this observation of yours has given me a lot of food for thought.

ADD REPLY • link 12.1 years ago by Zakir Naik ▴ 30

score 2 · Answer 2 · 2012-03-27

2

Entering edit mode

12.1 years ago

Philippe ★ 1.9k

Hi,

so far I don't see any practical application. That does not mean there are none of course. We generally have to deal more with biological contamination rather than "digital" contamination.

The only thing I can think about is to check if a sequencing machine did not generate any artefact/random sequence while writing its output. However the interest is still limited because:

this is highly unlikely to happen
for high-throughput experiment you can tolerate that some reads won't be used (i.e. they won't map to the reference genome or are not likely to strongly support some non-existing contig for de novo assembly)
so far, no technologies can sequence 5000 nucleotides in a row so your method won't be useable. Oxford Nanopore Technologies claim they can but so far no data is publicly released
the generation of a non-existing sequence by a machine does not mean it has been randomly generated, it can be the result of the concatenation/shuffling of actual reads.

The algorithm can still be useful as an example of machine learning for example (even though I don't know which method you used) but does not seem, to my opinion, to have any immediate practical application.

ADD COMMENT • link 12.1 years ago by Philippe ★ 1.9k

0

Entering edit mode

At your point 4, I don't think he is looking at non-existing sequences (he doesn't do any alignment). I suppose he looks at GC content/placing, codon bias etc that you don't expect in random sequences.

ADD REPLY • link 12.1 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

At your point 4, I don't think he is looking if the sequence exists (he doesn't do any alignment). I suppose he looks at GC content/placing, codon bias etc that you don't expect in random sequences. I have no idea when anyone would have a totally random sequence and not know that it isn't DNA though.

ADD REPLY • link 12.1 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

I did not make this assumption. I wanted to say that if you want to detect a machine's erroneous output, his method will only be able to detect if the machine wrote something totally randomly, and not if the resulting erroneous sequence is actually a sort of mix of different other (biological) sequences. Sorry if that was not clear.

ADD REPLY • link 12.1 years ago by Philippe ★ 1.9k

0

Entering edit mode

Thanks to all for the comments/perspectives. One more naive question. Assuming, I tweak the method to work for say 2000 bp strings, does it have any value ?

ADD REPLY • link 12.1 years ago by Zakir Naik ▴ 30