Question

Why does my Hi-C contact map show large regions making little to no contact?

0

Entering edit mode

11 days ago

Winter • 0

Hello everyone, bioinformatician in training here.

Currently working on Hi-C scaffolding. I have a set of scaffolds which were produced following the VGP genome assembly pipeline (HiFi data from a single individual, pooled Hi-C data, no Bionano data) (https://doi.org/10.1038/s41587-023-02100-3) which have now been mapped using BWA-MEM2, refined with YAHS, and then mapped with PRETEXT.

However, the contact maps look like nothing I've seen before. Only about half of each putative chromosome/bin appears to have a high contact frequency, and the other half appears to make no contact at all. Moreover, based on previous publications, we should have about 20 chromosomes, but we're seeing at least 50.

I've attached images of the full map + a representative bin. Any idea what might be causing this? I feel I must have made an error somewhere upstream in the assembly, but it's possible there are additional steps I need to take to refine the Hi-C data. Any suggestions?

So far, I've tried correcting all the potential mistakes I could think of, but none of these changes made any noticeable difference.

First, I'm working with two lanes of Hi-C reads (2 forward, 2 reverse). Initially, I was merging the raw FASTQ files before mapping, but I've since been mapping each set of forward and reverse reads independently with BWA-MEM2 and then merging the BAM files. I've also tried using only a single lane of forward and reverse reads.
Initially, I deviated from the VGP pipeline and decontaminated the contigs before scaffolding, but I've also tried scaffolding without decontamination as well.
I also considered whether the Hi-C reads have some sort of adapter contamination, but the company has said they already removed the adapters and FASTQC seems to support this.
I have also tried increasing the minimum mapping quality threshold for PRETEXT map (-mapq) to 30.

For what it's worth, the quality control measures for our contigs look great. The length matches closely with our estimated genome size and the BUSCO score is 99.6% with almost no duplicated genes.

Thanks everyone!

Full

Single

BWA-MEM2 Hi-C PretextMap • 185 views

ADD COMMENT • link 10 days ago by Winter • 0