I have downloaded the 2 paired end fastq files from ENA- documenting MNase Seq data for nucleosome occupancy across the mouse genome: http://www.ebi.ac.uk/ena/data/view/SRR2034494 I mapped the files with bowtie2:
bowtie2 -x mm9 --very-sensitive -p 8 -1 SRR1.fastq -2 SRR2.fastq -S out.sam
94.83% overall alignment rate
everything seems normal here but the problem is that the bed file that results from sam2bed
sam2bed < out.sam > out.bed
has resulted in a bed file with interval sizes that are typically in the 50-70 bp range... This is impossible as nucleosomes are 147 bp long and hence reads of <147 are not possible.
After receiving some advice I have found that the intervals may be too short due to 'sam2bed'. Does anyone know of a way to account for paired ends in sam files when converting to a bed file?
NOTE: I have also tried this with other independent nucleosome datasets and the same thing happens so it is not a question of the data
Are you sure about this? Unless data is old (5+ years) it is unlikely to be in solexa quality format.
Wait sorry I should have specified: it is the bowtie2 (which doesn't have any solexa quals component) command that I am having problems with rather than bowtie 1
The intervals represent individual reads, not fragments, so if the input was 50-70 base long reads then what you're seeing is entirely expected.
Thanks for the reply, please note that I have updated the question
What is your end goal with all of this? I suspect there's a better way to go about things.
Hi, I contacted the address given on the pubmed page. And I have been assured that this is paired-end data documenting nucleosome occupancy genome wide. They confirmed that the intervals in my bed file are too small to be nucleosomes. Instead one should expect 100-200 bp ranges rather than 50-70 bp. They told me that they originally mapped the data with bowtie 1 rather than 2 and this produced much larger, expected intervals.... I am considering just moving back to bowtie1 but I am still attracted to the superior alignment rates given by bowtie2 and would like to pursue it until I am sure that it can not produce same output as bowtie 1 with greater alignment. My downstream analysis requires for the data to be representative of the nucleosome sizes.
Did you even read my answer, explaining why you get these short intervals? You are converting single end reads instead of paired-end fragments.
This should have been the alarm for you that your pipeline is wrong and not the data.
Hi, I should have replied to your answer I will do so now in a comment on your answer below
chrisclarkson100 : bowtie v.1 does ungapped alignments so keep that in mind when thinking about "superior" bowtie2 alignments which may be gapped.
Right, so you haven't answered my question. What exactly are you trying to do with the data? If you run
bamPEFragmentSize
on the BAM file you'll probably see a more meaningful distribution. But even if you can make a BED file that matches that it doesn't mean that that's a reasonable way or proceeding.My job is to document the position of each nucleosome in the genome- using various datasets from MNAse experiments. My job will then be to calculate the nucleosome repeat length (NRL) in different regions. I have a script that does this by analysing the consecutive distances between the starts of each positioned nucleosome. The script requires the interval distances documenting the start and end of each nucleosome to be of a realistic size i.e. not < 147 in order to give viable statistics.... I have previously used a script to account for the paired end read data: https://github.com/homeveg/nuctools/extend_PE_reads.pl but this did not seem to make a difference to the intervals, the vast majority of which are < 147. Hence I was worried that it might be a step prior to taking PEs into account.
I imagine your job would be much easier if you skipped BED files and instead did
bamCoverage --MNase -bs 1
, which will give you a bigWig file of nucleosome density. I'd just Fourier transform that and get the repeat length that way.