Demultiplexing fastq file on Identifier line allowing mismatch
1
0
Entering edit mode
9.2 years ago
pat2402 • 0

Hi,

I have a multiplexed fastq file that contain reads as following:

@HISEQ:55:H76W4HIWA:1:1101:3414:2138 1:N:0:BC1:BC2:BC3
TTCCCCCAGTAGCGGCGAGCGAACGGGGAGCAGCCCAGAGCCTG
+
FBFFFFFIIFFIIIIIIIIIIFFIFFFFFFFFFFFFBBBBB<B7
@HISEQ:55:H76W4HIWA:1:1101:6230:2144 1:N:0:BC1:BC2:BC3
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
FFFFFFFIIIIIIIIFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

I have a quasi paired-end sequencing but the second read only contains two barcodes (BC2 and BC3). Therefore I transferred BC2 and BC3 from read2 to the header of read1 (together with BC1, part of read1 sequence). I want to demultiplex this file by the barcodes (e.g. "BC1:BC2") in the identifier line. The barcodes are known but I need to demultiplex the fastq file by allowing one mismatch for BC1 and BC2. I tried fastq-grep, but unfortunately its not possible to allow a mismatch. Have you any suggestions?

I would be very happy about every kind of help. Thank you.

ps. I can also change the delimiters between barcodes..

RNA-Seq sequencing fastq demultiplexing • 4.9k views
ADD COMMENT
1
Entering edit mode

You can demultiplex FASTQ files while allowing mismatches in the barcodes with the tool TagDust 2, but by design it will not let you control the exact number of mismatches. (This is why I post this as a comment rather than as an answer). You can find a benchmark comparing it with other tools in its publication.

ADD REPLY
0
Entering edit mode

Doesn't having the barcodes in the ID line also mean that the data has been already demultiplexed and the barcode information is not actually present in the data. When the Casava pipeline (that produced this data is run) you have the choice of inputting the number of mismatches.

ADD REPLY
0
Entering edit mode

The reads are multiplexed. I will edit my post to make my problem a little bit more clear.

ADD REPLY
0
Entering edit mode

If you give some details about your experiment, it would be easy to guess whether you have demultiplexed data or not. Usually, if its illumina data, the casava pipeline would have been run on your data. Confirm with your sequencing facility.

ADD REPLY
1
Entering edit mode
9.2 years ago
Ram 43k

Disclaimer: I know nothing about multiplexing, I'm addressing this as a string manipulation problem.


This might be addressed by framing a regex for fastq-grep that allows for one mismatch and one mismatch alone. I'm assuming you're looking at one possible mismatch for each of the two barcodes.

I'll address dealing with one, and then we can look at combining two of these.

Let's say your barcode is ATCACG. You wish to allow one mismatch. The possible barcodes then are:

[ATGC]TCACG
A[ATGC]CACG
AT[ATGC]ACG
ATC[ATGC]CG
ATCA[ATGC]G
ATCAC[ATGC]

and the cumulative expression is:

([ATGC]TCACG|A[ATGC]CACG|AT[ATGC]ACG|ATC[ATGC]CG|ATCA[ATGC]G|ATCAC[ATGC])

And if the two barcodes are separated by a :, you can just make it work by separating two such expressions with a [:]

Let me know if any of my assumptions is mistaken.

ADD COMMENT
1
Entering edit mode

I think this will work but it would be interesting to know if this is performant at the scale of large fastq files. Regular expressions can exhibit big variations in performance, different patterns with identical effects can perform at very different speeds.

I think tools like say cutadapt and trimmomatic could be also used used separately for each adapter. Mothur also has an adaptor splitting and filtering command. And there are some dedicated tools for this (although lately since casava performs the job as well these tools have fallen off the radar).

ADD REPLY
0
Entering edit mode

Thank you very much for your reply. I will try this cumulative expression in combination with fastq-grep. I'm also curious to see how the performance of this combination will be.

ADD REPLY

Login before adding your answer.

Traffic: 2701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6