Hi,
We have seq. data (targeted panel; ~500 genes) that was aligned with BWA-mem to the human reference hg19. We are interested in retrieving aligned reads (from respective bam files) that are UNIQUE to the Y-chromosome.
The problem: The Y-chromosome contains a lot of repetitive (duplicated) sequences (ampliconic regions) and many coding genes (roughly 78) have an X-linked homologue with 60-90% nucleotide similarity. Since we are interested in calling Y-chromosome losses we are - broadly speaking - interested in seq. coverage ratios between the tumor and the normal samples.
Given the nature of the Y-chromosome, there are now a few questions:
- How can I filter aligned reads, so that I can guarantee that those are unique to the Y-chromosome?
- does it make sense to filter those with MAPQ > 30 (to ensure that those reads unambiguously come from the Y-chromosome)
- e.g.
samtools view -F 256 <bam.in>
- There are some FLAGS like (XA; alternative reads, etc.) described but not sure if those are still supported (the last documentation that I found was somewhat 10 years ago)
- Moreover, given the fact that the X- and Y chromosome show great similarity, how can I make sure that those reads do not equally map to the X-chromosome?
- Is there a way we can filter reads that map > 1 to the Y-chromosome
- e.g.
I know there is a lot of information on MAPQ values, etc. but also, a lot of ambiguity.
Many many thanks for your help and advice