Issue With Alignment Of Short Dna Sequences
2
0
Entering edit mode
12.1 years ago
Arpssss ▴ 30

Can anybody inform me, in the algorithms for "alignment of short DNA sequences to the human genome" whether the repeats in genome is considered as Junk or not ? More clearly, as an example: BowTie , Blat etc considers repeats in the genome or not while building index. Thanks.

blast dna sequencing genome bowtie • 2.2k views
ADD COMMENT
4
Entering edit mode
12.1 years ago

First of all the concept of junk DNA is both outdated and misleading due to the secondary meaning of the word. So don't use it, there is no such thing as junk DNA, only DNA with yet unknown function.

As for your question, a mappers job is to align a read to a genome, forget about index building that is just a intermediate step that is not relevant to their function. When a read maps to multiple locations then the mapper may report all the mappings (usually with cutoffs) or report one location with a flag indicating that the read maps to other locations as well. This behavior is mapper specific.

ADD COMMENT
0
Entering edit mode

@Istvan Albert, Thanks. However, is Mapper (mostly) take into consideration the repeats in the genome ? Example: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ here Repeats are shown in lower case (a,c,g,t). Is the mapper take those a,c,g,t or only cosiders A,C,G,T (non-repeating sequence is shown in upper case) ? Thanks.

ADD REPLY
0
Entering edit mode

I think in most cases one is better off aligning the reads then deal with the reads map that map to multiple locations rather than ignoring parts of the genome

ADD REPLY
1
Entering edit mode
12.1 years ago
Vikas Bansal ★ 2.4k

Yes, Istvan is right. It depends on the mapper and parameters you are using. If you do not want to map your reads to repeats, just mask them with N's (or download pre masked genomes available at repeatmasker <http://www.repeatmasker.org> ) as most of the mapping tools will skip N's.

@Istvan: May be I am wrong but as you said "forget about index building", I think sometimes it is important. In this case, like Arpssss said repeats are shown in lower case - Novoalign says that "indexing process can optionally ignore lowercase nucleotides in the reference genome" . Now if some one wants to ignore lower case, then he/she should do the indexing carefully.

I read about novoalign here and it says -

Question: How does novoalign handle lowercase masking?

Answer The indexing process can optionally ignore lowercase nucleotides in the reference genome/nucleotide database. This means the lower case letters cannot be used to initiate an alignment however the Needleman-Wunsch alignment process can extend into lower case sequence as long as an alignment was initiated in indexed upper case sequence.

ADD COMMENT
0
Entering edit mode

what I wanted to say is that index building is a tool specific step that may or may not be required, it is similar to setting a certain parameter or cutoff - there is no generic advice that we could give on how to index in general

ADD REPLY

Login before adding your answer.

Traffic: 2266 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6