Illumina Overlapping R1/R2 reads, error-correction in python
0
0
Entering edit mode
4.9 years ago
Mick ▴ 30

Hi guys, I'm new to this community.

I have two fastq files per sample from our Illumina platform with the forward and reverse reads. I cut the reads in both files so that only the segment in the middle where the two reads overlap is left ( around 100bp). I would like to use forward and reverse reads for error-correction. I want to look at each position and only keep the bases where forward and reverse read show the same base.

First I aligned both files seperately using bwa.

Now I would like to open both aligned bam files in pysam and compare forward and reverse read with each other. How can I achieve this? I think I would probably have to sort the bam files by read name and then go through both files in parallel. But this seems clumsy and error-prone to me. Is there a way in pysam to get corresponding forward and reverse reads from two bam files?

Thank you very much in advance.

alignment sequence SNP • 2.5k views
ADD COMMENT
1
Entering edit mode

You make me wonder why you want to do that. I'm not sure for which application this approach would be recommended. Next question: you only want to keep bases for which both reads agree, so you want to completely remove other bases (and as such split reads?) or mask them with an 'N'? Note that all of this involves more than just changing the nucleotides, you for example also have to change the CIGAR, NM and MD tags. Third question: why do you expect this to make such a big difference? Illumina reads are in general pretty accurate, and variant callers are smart enough.

Fourth, why don't you align both reads together; to the same bam file? That would conceptually be a lot easier.

ADD REPLY
1
Entering edit mode

If your only aim is to error correct then there are tools for that. Look into bbmerge.sh from BBMap suite You can also use tadpole.sh or bbnorm.sh: Correction of errors or mismatches in the read file

ADD REPLY
0
Entering edit mode

Hey guys, thank you for your input. Ok so my sequencing data may be a bit unique. I have around 1 million reads per sample. All from the same 100bp genomic region. I'm looking for a rare SNP mutation at one specific base of that region (say 0.1% of the reads have the mutation at that position). The idea is to get a .wig file from the data with the count of each base A, C, G and T at each position. Then use the other 99 positions to get a baseline for how big the average error rate is and what the standard deviation is. Then use this information to look the target position and see if the there are more mutated bases than the error rate + 3 SD. If yes call it positive.

For this to work down to more rare mutations I want to error correct by comparing forward and reverse read. I think it would be best to just not count mismatched bases in the wig file, rather than make a call based on quality or something. Replacing it with N could work aswell.

Just to be clear, we also have UMIs in this approach, but I'm just learning how to use NGS Data and I thought I'd start out with this and then work with the UMIs once I have figured this first step out.

Thank you genomax for the suggested tools, I'll take a look at them and see if I can use them for this purpose.

ADD REPLY
0
Entering edit mode

You should work with UMI's first (they are separating the samples or regions?) and then deal with the overlaps.

You have crazy deep coverage if you are only looking at a 100 bp regions. You may have to check into this but super deep coverage can cause its own issues. Someone with more experience in this area will comment.

ADD REPLY
0
Entering edit mode

Hi genomax, we have barcodes on the outside of the read which we use to seperate the samples. I have already dealt with that with the tool cutadapt. It worked well, I didn't have any problems there. So the samples are already seperated into their own fastq files for F and R read.

The UMIs are random tags of length 12bp that we add next to the genomic regions before amplification by PCR. These can be used later for error correction aswell, because all reads with the same UMI should have the same sequence because they originated from the same original DNA strand. This is why we need such deep coverage because multiple reads are needed for each UMI.

I thought I'd start out with the more simple forward and reverse read correction and once that works I'll try to implement the UMI correction.

ADD REPLY
0
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original questions.

Illumina has a pretty low error rate of sequencing but at such great depth it will start to be evident in your data. I would not start with correcting UMI's but just use the ones that match for start. There may be enough reads that fit the bill.

BTW: Was a proof of concept/small dataset ever investigated for this approach and shown to work?

ADD REPLY
0
Entering edit mode

This approach has been in use in our lab for a while now. I'm a med student and usually my job is the library preparation in the lab. I'm trying to figure out how the data analysis part works because I would like to be able to take a look at the data myself and draw my own conclusions, but our bioinformatician is very busy and doesn't have the time to teach me.

So far I'm digging the idea to join forward and reverse read bam files into one bam file. Then I should be able to sort the file by name and then forward and reverse read should always be below one another in the file. I'll work on that and let you know how it goes.

ADD REPLY
0
Entering edit mode

I use bwa for the alignment. Can I align both files for forward and reverse read with the option "single-end" and then join the bam files or should I use "paired-end". Is there any advantage using paired-end for my purposes?

ADD REPLY
0
Entering edit mode

NOTE: Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. SUBMIT ANSWER is for new answers to original questions.

If you have paired-end data you should align it as paired-end. I am still not clear what path you are trying to take here. Recreate what your bioinformatician does or try stuff out on your own.

ADD REPLY
0
Entering edit mode

The way I see it you are over-engineering a problem that has already been solved - and with that, you are making the process less precise and accurate.

{edited} - not relevant advice as I did not fully understand the OPs problem layout. SNP callers are not quite ready to call variation on data of the type the OP has.

ADD REPLY
0
Entering edit mode

I set out to try and recreate what our bioinformatician does, but since I do not have a step by step instruction I need to get creative here and there.

Well its not really paired end data in the usual sense right?, because its actually completely the same genomic region just in reverse. There is not "gap" in between and the overlap is 100%. So I'm wondering if there is any reason to put up with the paired-end alignment or if single-end will do just fine.

ADD REPLY
0
Entering edit mode

Ok, now I understand better what you are doing (after reading further down the line).

This is really not about error correcting (as the title puts it) but it is about how to analyze very a short sequence that has been measured millions of times with the goal to identify rare mutations - where the rarity of the SNP may be indistinguishable from errors.

There seems to be some similarity to RAD-Seq methods like STACKS http://catchenlab.life.illinois.edu/stacks/ see if that would help.

http://catchenlab.life.illinois.edu/stacks/

ADD REPLY
0
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6