Question

multiple DNA reads aligned to the same gene

1

Entering edit mode

6.2 years ago

Ana ▴ 200

Hello everyone, I have aligned a bunch of whole genome shotgun sequencing data against Pangenome protein sequences by using DIAMOND protein aligner to find out which genes are present and which are absent in my DNA sequence data . I am sorry if my question is dumb, but I am new in alignment and learning. I have some genes which are 100% matched with mutiple DNA sequences, please see the example below! (this the output of DIAMOND which is Blast tabular file format)

HWI-ST913:300:C5W5DACXX:7:1207:13092:33669  Ha15_00039042   100.0   24  0   0   3   74  183 206 1.5e-07 51.6
HWI-ST913:300:C5W5DACXX:7:1208:1383:39276   Ha15_00039042   100.0   26  0   0   22  99  175 200 1.3e-09 58.5
HWI-ST913:300:C5W5DACXX:7:1208:3015:81325   Ha15_00039042   100.0   29  0   0   2   88  178 206 1.1e-10 62.0

I want to know what does this mean? Is this something that I should expect or something is going wrong? Thanks for any help or providing me with some references to learn about this!

alignment protein whole genome • 1.8k views

ADD COMMENT • link 6.2 years ago by Ana ▴ 200

1

Entering edit mode

It's three reads that were mapped to the same gene, to different positions along the gene (183-206, 175-200 and 178-206). Looks like your alignment is working as it should.

ADD REPLY • link 6.2 years ago by Asaf 10k

0

Entering edit mode

Thanks a lot Asaf, got it now. Another little question, how about if the same read matches to multiple genes? like this: could it be possible?

HWI-ST913:300:C5W5DACXX:7:1101:1649:2180    Ha12_00033467   100.0   33  0   0   100 2   91  123 7.8e-12 65.9
HWI-ST913:300:C5W5DACXX:7:1101:1649:2180    Ha10_00000535   100.0   33  0   0   100 2   116 148 7.8e-12 65.9
HWI-ST913:300:C5W5DACXX:7:1101:1649:2180    Ha7_00045828    100.0   33  0   0   100 2   557 589 7.8e-12 65.9
HWI-ST913:300:C5W5DACXX:7:1101:1649:2180    Ha17_00008591   100.0   33  0   0   100 2   118 150 7.8e-12 65.9
HWI-ST913:300:C5W5DACXX:7:1101:1649:2180    Ha15_00038715   100.0   33  0   0   100 2   173 205 7.8e-12 65.9

ADD REPLY • link 6.2 years ago by Ana ▴ 200

1

Entering edit mode

Reads can certainly multi-map. Think about multi-copy genes (e.g. rDNA repeat) that may be present in multiple locations on the genome.

ADD REPLY • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

Thanks @genomax, very helpful. Then my last question is that if the same read matches to multiple genes then can I say all these genes present in my DNA sequence data-set? Because the final aim of doing this alignment is to understand which gene is present and which gene is absent in my DNA sequence data-set? Thanks a lot.

ADD REPLY • link 6.2 years ago by Ana ▴ 200

1

Entering edit mode

No.

With the information you have (i.e. sequence of the read) you can't say for sure which copy of the gene (multiple) it originally came from. It could be from any one of them. There is also a possibility that the read could be aligning there by chance (due to sequence similarity of that piece) and may not have originated from the gene it is aligned to.

Note: While mapping you have multiple choices on how to handle reads the multi-map. Following is an example of options from BBMap suite (all aligners may not have same options).

ambiguous=best          (ambig) Set behavior on ambiguously-mapped reads (with 
                        multiple top-scoring mapping locations).
                            best    (use the first best site)
                            toss    (consider unmapped)
                            random  (select one top-scoring site randomly)
                            all     (retain all top-scoring sites)

In RNAseq , multi-mapping reads are generally not counted by default.

ADD REPLY • link 6.2 years ago by GenoMax 141k