We have NGS whole exome data for 2000 cases and 2000 controls study. Sequencing and QC is done by third party, in theory we already have a "clean" data.
Planning to do gene based tests (skat, burden, etc, or any other analysis), would you advise additionally remove any (all?) of below repeat regions (files are available at UCSC goldenpath):
- simpleRepeats.txt.gz
- rmsk.txt.gz
- genomicSuperDups.txt.gz
We are settled to use simpleRepeats, but no real rationale why. Is it common practice to remove repeat regions, if yes which ones?
We were also suggested to use "blacklists", which ones to use?