I'm simulating variation in a human genome, but I want to restrict simulated structural variants to places where they're known to occur. For this, I'm interested in assessing which parts of the human genome (introns, exons, and intergenic regions) are, in an average human, more, or less, structurally conserved.
Essentially, I'd like data that I can use to map most loci to a conservation level or degree; real-valued, categorial, or even binary (conserved vs not conserved) is sufficient. If I have a list of structural variant sites (like in a VCF or GFF file) that's sufficiently representative, I should be able to restrict my simulator to those sites (or sites like them using a mixture model).
dbVar and DGV seem like good sources for this data, but they're both heavily biased toward disease-associated variants, so they're unlikely to be representative of variation in an average person. 1000 The low-coverage data sets from 1000 Genomes' pilot study has some data (e.g. MobileElementInsertions.sites.vcf), but even with all populations and all variant types mashed together, it only contains tens of thousands of variants.
Does anyone have insight on this? Is there a large, unbiased source for this kind of data? If not, what's my best option?