Biostar Beta. Not for public use.
Mpileup generation using Bisulfite converted reference
Entering edit mode
15 months ago
kspata • 50

I have a bisulfite converted reference genome (used bismark_genome_preparation on the reference file). Can I use this bisulfite converted reference genome to create index file using samtools faidx? I want to create a samtools mpileup file for downstream analysis.

The bismark_genome_preparation creates following folders and files:

Bisulfite-Genome: 1. CT_conversion -BS_CT.1.bt2 -BS_CT.3.bt2 -BS_CT.rev.1.bt2 -BS_CT.2.bt2 - BS_CT.rev.2.bt2 - BS_CT.4.bt2 - genome_mfa.CT_conversion.fa 2. GA_conversion -BS_GA.1.bt2 -BS_GA.3.bt2 -BS_GA.rev.1.bt2 -BS_GA.2.bt2 - BS_GA.rev.2.bt2 - BS_GA.4.bt2 - genome_mfa.GA_conversion.fa

Can I use the files genome_mfa.CT_conversion.fa and genome_mfa.GA_conversion.fa to create index files using samtools faidx command?

My other question is

How to generate mpileup files from these faidx reference files? Because it will generate two mpileup files for each reference sequence? Can I then merge two mpileup files for downstream analysis?

 samtools mpileup -d <integer> -f <ref1>  <input.bam> >  output_1.mpileup
 samtools mpileup -d <integer> -f <ref2>  <input.bam> >  output_2.mpileup

Can I then merge the two mpileup output files?

Is there any other approach?

Thanks In advance.

Entering edit mode

That would be a yes on the first question. As for the second question, that would be a bit more dedicated. I assume, you'd need to check each position from both mpileups and convert the information so that the variant information conforms to the original reference (for mpileup the CT and AG converted references are two different references, so the information provided in the files is not compatible.)

What I am wondering, though - why don't you use bismark and its downstream analysis? The whole genome methylation report will then give you exactly what you seemingly want to have (at least if I infer it correctly from your question.)

Edit: bismark-comment

Entering edit mode

Thank you for this insight. I really appreciate it. I did use bismark pipeline for downstream analysis instead of the mpileup generation.

I have another question where I am studying the methylation status of a gene DMPK. For this purpose is it better to map samples across whole genome reference sequence (hg38) or only DMPK gene region?

I ran my samples with both and observed that coverage after mapping to entire human genome is very less (<7X) and when I map same samples with DMPK gene sequence the coverage increases to upto 100X. The samples were sequenced on MiSeq and are bisulfite converted amplicons.

Which one of the two approaches is correct?

Entering edit mode

I think the way to go here is to map against the whole genome. The aligner tries to map as many reads as possible, so if there is some similarity between your gene region and other regions in the genome, it might just be good enough to (mis-)align reads from those other regions to your target region. If you map to the whole genome, those reads will have better alignment scores against their original regions and will therefore not align to your target, which would explain your low coverage.


Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1