Biostar Beta. Not for public use.
How to handle chr Y data in BAM files from exome sequencing of human females
4
Entering edit mode
3.4 years ago
United States

It's always strange to see chr Y positions that pass QC in BAM files from human females. I've noted this in several BAM files generated from exome sequencing (chr Y data when data is from a female).

While some people may just ignore chr Y data when sequencing a female, what are best practices if the sex of the individual is unknown?

Can the sex of the individual be determined from just the BAM itself if the gender is not previously known?

5
Entering edit mode
18 months ago
Walnut Creek, USA

This is a difficult question because of XXY people, XYY people, and the fact that much of the Y chromosome is shared with the X chromosome.

The best way to determine whether someone has two copies of X is to look for heterozygousity in the X chromosome. The best way to determine gender, assuming anyone with at least 1 Y chromosome is male, is to look for coverage of the non-PAR (pseudo-autosomal region) of the Y chromosome. Raw coverage of X vs Y is misleading (unless recalibrated) due to the PAR and possibility of multiple copies of either.

If you assume anyone with 2 X chromosomes is female, which is usually correct, I recommend looking for heterozygousity of SNPs in the X chromosome. Even with XXY males, they will genetically behave more like females in inheritance studies (since a single mutation in an X-specific gene will not cause a complete knockout as it would in XY males, and their daughters could inherit either of 2 X copies, and there are many more X-linked problems than Y-linked problems), so considering them genetically female is probably a better approach anyway, since most software that distinguishes between male and female assumes they are XX and XY. The male phenotype is not very important in heredity studies, just the genotype (by "male phenotype" I don't mean "the phenotype of the male", but "the fact that the person is phenotypically male").

2
Entering edit mode

that is good point about the PAR region! your answer is much more complete than mine. Do you know of any tools that would look at both coverage and SNPs to make an inference on XX/XY/XYY/XXY? Also since we are on this topic, would it be worth worrying about contamination from sample prep (say a male was handling a female sample, could that make it seem XXY?)

2
Entering edit mode

To tell the gender of a sample, we don't need to do het calling. Just compare the average read depth in non-PAR non-repetitive regions on X to that on an autosome as you said. This can be done in ~1 minute with random sampling on a mapped BAM. Of course combining depth with het calling is better, but it is more complex and takes more time.

0
Entering edit mode

Thank you! Is there a resource that provides the coordinates for both the PAR and non-PAR non-repetitive regions on X?

1
Entering edit mode

As long as the contamination is not terrible - say, under 2% - it shouldn't really affect an analysis with reasonable cutoffs. Variant callers should not call a variant with under 1:20 allelic ratio heterozygous (since that has at best a 1:1000000 chance of being correct, ignoring ref-bias), and +-2% will not affect your ability to quantify the coverage ratio of the unique portions of the X and Y chromosome to determine XX/XY/XXY/XYY. That said, in a human test, I would consider 2% contamination from another person to be completely unacceptable; if that's detected it should be redone starting with the blood-draw.

I do not know of any tools for determining sex chromosome numbers for non-XX/XY individuals, but probably. BBMap used to have a mode to declare reads mapping to the PAR to be uniquely mapped to the X, rather than ambiguously mapped to X and Y, inspired by Complete Genomics' "PAR-called-in-X" classification of variations found in the PAR - which is a reasonable classification, since those genes have 2 copies and thus should be treated more like autosomes. But, I don't study humans anymore so that mode went away :)

0
Entering edit mode

Also, on the subject of contamination -

Contamination from handlers is not a big issue unless the library is highly amplified. JGI sequences only non-human organisms, and I wrote the human-contaminant removal pipeline; for the vast majority of libraries, with millions of reads, there are only 10s to 100s that map to human. So, having more than a few ppm of contamination from the handler is a rare event that should be reprocessed (when dealing with human data where the contaminant reads cannot be easily removed).

Cross-contamination from libraries on the same plate, or multiplexed together, is a MUCH bigger issue. When the robotic systems have problems (samples physically sloshing into each other), this can lead to >10% contamination; but even barcode misidentification and barcode reagent contamination can lead to >1000ppm contamination. MiSeq has too small a volume to use multiplexing on human exome-capture, so it's fairly safe, but HiSeq is particularly prone to this kind of issue. Therefore, I highly recommend using dual barcodes (at a minimum) and being conservative (favoring het calls) when dealing with multiplexed data.

0
Entering edit mode

Talking about contaminations, you haven't addressed my concern in comment https://www.biostars.org/p/136189/#136255. If you still use your old pipeline, JGI data might be contaminated with human sequences that have no matches in other primates. This might be just a very minor concern, though, as the human reference genome is incomplete anyway. If you still think you are right, it would be good to give concrete examples of contaminations and talk to GRC. GRC is highly experienced and will take the issue very seriously. That is for the better of JGI, GRC and everyone who use the human reference genome. Thanks.

0
Entering edit mode

Ah, well, I'll get around to it soon. It's just that it will take me a day or so to isolate the exact regions again.

1
Entering edit mode
16 months ago
Len Trigg ♦ 1.3k
New Zealand

There certainly are tools available that correctly handle sex chromosomes. With the RTG suite of tools, your human reference can contain a configuration file that specifies the autosomes, sex chromosomes, PAR regions etc, and this file is consulted during both mapping and variant calling. During processing for a sample, you specify the sex (if known), and it Just Works.

For example, during mapping (rtg map), when processing a sample specified as female, the aligner will not attempt to map reads to the Y chromosome. Similarly it will not map reads to the PAR region on the Y chromosome for males (although most human references already have this region masked out anyway). When performing variant calling (rtg snp / rtg family / rtg population / rtg somatic) on a sample specified as male, the caller automatically uses haploid calling for the X and Y chromosomes (except for the PAR regions, where diploid calling is carried out). When performing joint calling of multiple samples that form a pedigree, either a single family (rtg family), or larger multi-generation pedigree (rtg population), the variant calling utilizes mendelian inheritance in the Bayesian models to further inform the variant calling (including appropriate behaviour on the sex chromosomes). Of course, you can manually override any of this if you want.

For all of the modes, there is a validation tool that executes automatically at the end of a mapping job, and it can be explicitly invoked (rtg chrstats) if you have split your mapping into multiple smaller jobs and want it to use aggregate statistics, which analyses the coverage of the autosomes and sex chromosomes to indicate whether the sample may have been mapped with a sex that does not correspond to what the sample actually is (for example, due to sample mislabelling, incorrect pedigree information, or chromosomal abnormalities such as XXY, trisomy, etc).

Give it a go, RTG Core (which contains all that goodness) is free for non-commercial use.

0
Entering edit mode
16 months ago
Ying W ♦ 3.9k
South San Francisco, CA

the 'right' way of doing things it to have a reference for XX and have another reference for XY (and sometimes you will get XXYs!)

you might be able to get clinical data telling you if the sample is from male or female (and choose reference accordingly). You can predict after mapping but then you might end up remapping it which is kind of annoying.

0
Entering edit mode

Thank you for the response. What method to you recommend for predicting sex after mapping if the sex isn't known beforehand (and therefore a XY reference genome is just used by default)?

0
Entering edit mode

I've looked at coverage across X, Y, and 20 (why 20? I don't know, its smaller so it calculates coverage faster. probably better to calculate coverage across more chr, i remember there was a post on biostars a while back that had this nice ggplot figure that plotted chr coverage vs chr size to look for amplification/deletion).

If X and 20 are about the same and Y is low then its probably female. If X and Y are both about half of coverage across 20 then its probably male. There are probably tools out there that would do the prediction but I don't know of any off the top of my head tho I think i've seen this functionality before when using rtg tools

0
Entering edit mode

Are there separate standard human references for XX and XY that are accessible? Or do you mean just remove the Y chromosome from one of the human reference genomes, such as removing the Y from 38.p3 to produce an XX reference?