Question

Should we ignore variants in any duplicated gene, or only ignore variants in duplicated segments?

1

Entering edit mode

6.1 years ago

Tania ▴ 180

Hi all

If we have an interesting variant, checked the alignments, the coverage, bias strand,,...etc and it is ok. But the variant is in a duplicated gene (checked the duplicated gene data bases) but the variant is not in a duplicated segment? Should we ignore it?

Thanks

Variant calling • 1.2k views

ADD COMMENT • link 6.1 years ago by Tania ▴ 180

2

Entering edit mode

What I usually did, was to only consider variants called using uniquely mapped reads (probably corresponding to a "unique segment"), irrespective of the duplication state of the gene. Even paralogs, may have slightly more divergent regions in which you can confidently map, and you shouldn't discard this data, in my opinion.

ADD REPLY • link 6.1 years ago by Fabio Marroni ★ 3.0k

1

Entering edit mode

You'll find that the majority of the human genome exhibits some level of sequence similarity. The lack of adequate genomic maintenance that has allowed this to occur has ironically helped us to evolve and to confer new functionality to genes by copying them (or parts of them) and then allowing them to mutate over millions of years. The human genome is very messy, though.

Aside from just sequence similarity, duplicated genes are problematic in NGS and a majority of protein coding genes have a related pseudogene. As per Fabio, I do not recommend just throwing out data for any particular gene that is duplicated. The 'unique alignment' idea, mentioned by Fabio, is one good way to improve the situation, but the way in which some aligners implement 'uniqueness' is merely by looking at the MAPQ. Bowtie genuinely can only map reads that uniquely align, though - use the --best -m 1 parametrs passed to bowtie.

Aside from unique alignment with Bowtie (v1), setting a MAPQ threshold >40 or 50 (Phred-scaled) is recommended (by me, I guess) (samtools can be used to throw out or mark reads that fall below a particular MAPQ), and then also looking at the values for other things, such as:

DP4, Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases
PV4, P-values for strand bias, baseQ bias, mapQ bias and tail distance bias

Take a look at the vcfutils.pl executable that comes bundled with BCFtools, as it can be used to help apply extra filtering on your variants.

Finally, we have to remember that certain regions of the genome, including coding exons, are impossible to be faithfully sequenced using 'short' read NGS. This is problematic for clinical testing companies who are interested in genes falling into this category, and requires that a side method is used, such as long read NGS, Sanger sequencing, or something like MLPA (Multiplex ligation-dependent probe amplification).