Hi,
we are having some trouble with our chip-seq experiment. we're sequencing yeast in a paired-end mode (2X76b length), we have eight samples with four biological replicates. The samples were barcoded and multiplexed over the four lanes of a flow cell.
the quality of the data is not as good as we would like it to be, but what I am wondering most is the fact, that the reads from the reverse strand (R2) show a much lesser quality than the forward strand (R1) reads. And this is consistent over all eight samples.
I don't think this is a lane-specific problem. As you can see in the attached pictures below, the forward strand (R1) looks better than the reverse strand (R2) independent of the lane and or sample.
As I can't think of any biological solution for the problem (is there?), I would appreciate if someone has already encountered this kind of problems with his/her data and can share the experience. Is there a technical problem here?
thanks
Assa
.
:
What sequencer was this done on (I am guessing a NextSeq) and what was the cluster density?
Have you tried a scan/trim program to eliminate the possibility that you have short inserts and you are just sequencing adapters on R2 end.
I haven't tried that yet for this data set. for an earlier data set I did run both sickle and trim_galore to remove low quality and over-represented reads. The data looked better after that, but it reduced the total library size dramatically (sometimes more than 50%).
Where can I see the cluster density? Is it something I can get from the sequencer or do i need to calculate it on my own?
You can download sequence analysis viewer from Illumina (Windows only) and then point it to the folder containing the raw data. You will be able to see a lot more detail from the run. One of the things you can see is the cluster density. Since this is a NextSeq run that should be between 150-200 K/mm^2.
If this is a new sequencer then these could just be "teething problems" as your techs get used to the instrument, refine concentration estimation, run procedures etc. For any run that looks less than optimal you should contact Illumina tech support and have them remote in (or if your sequencer is disconnected from the network then you will need to send some files in) to look at the run. This helps eliminate hardware/software/reagent issues.
You should be very careful about low quality and over-represented reads. They may be there for a reason and unless the submitter wants you to remove them there is no reason for a core-facility to even look at them. If the run fails your overall criteria of average quality for a "good run" then that is a different issue.
The data size reduction seems to indicate that you must have a large amount of primer dimers (or short inserts) in that dataset.
You wrote chip-seq but tagged 'rna-seq', but this is DNA sequencing right?