Biostar Beta. Not for public use.
Question: Demultiplexing of the Illumina PE data
0
Entering edit mode

I'm looking for a convenient tool, to demultiplex my Illumina PE data. Particularly to extract pairs with a certain sequence in the forward read and other certain sequence in the reverse one. Could you advise me please? For example: Initially, we have two fastq files with forward and reverse reads

Forvard reads sequences:

NNNNNNAGTCCGTATATGCCGAGNNNNNNNN
NNNNNNAGAGCGTATATGCCGAGNNNNNNNN
NNNNNNAGTCCGTATATGGGGAGNNNNNNNN

Reverse reads sequences:

NNNNNNNNNGAGATGGACTACTCACNNNNNN
NNNNNNNNNGAGATGGATTACTCACNNNNNN
NNNNNNNNNGAGAAGGACTACTCACNNNNNN

So, i'd like to extract for futher analysis only pair

NNNNNNAGTCCGTATATGCCGAGNNNNNNNN
NNNNNNNNNGAGATGGACTACTCACNNNNNN

Since in the forward read is AGTCCGTATATGCCGAG tag and there is GAGATGGACTACTCAC tag in the reverse read. Now i need only 100% match.

ADD COMMENTlink 19 months ago Denis • 70 • updated 19 months ago genomax 68k
Entering edit mode
3

Hi Denis,

It is always useful to provide examples of input and desired output to clarify exactly what you are trying to achieve? Are you looking to select a subset of reads with a certain string? Have you looked at related posts on this forum?

Extract specific reads from FASTQ files based on subsequence

Count and location of strings in fastq file reads

ADD REPLYlink 19 months ago
Sej Modha
4.2k
Entering edit mode
0

Hi Sej,

I've updated my post to address your points. Thanks!

ADD REPLYlink 19 months ago
Denis
• 70
Entering edit mode
0

You can use prinseq tool with -custom-params with the specific string that you are looking for.

ADD REPLYlink 19 months ago
Sej Modha
4.2k
Entering edit mode
2

Hello Denis,

thanks for adding an example. But your example doesn't look like your real input, as this is neither fasta nor fastq. Furthermore what has the task you are trying to solve to do with demultiplexing?

What I read out of your description is, that you're trying to remove duplicate sequences. This can be done for example with seqkit:

$ zcat input.fa.gz | seqkit rmdup -s -o output.fa.gz

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.

code_formatting

fin swimmer

ADD REPLYlink 19 months ago
finswimmer
11k
Entering edit mode
0

Hi fin swimmer!

Many thaks for your reply and post editing. No. I'm working with Illumina amplicon data. So i'd like to extract pairs that contain PCR primers and discard all the other read pairs.

ADD REPLYlink 19 months ago
Denis
• 70
Entering edit mode
0

Are these Illumina barcodes or internal barcodes/sequences?

ADD REPLYlink 19 months ago
Devon Ryan
90k
Entering edit mode
0

It's a custom internal PCR primers.

ADD REPLYlink 19 months ago
Denis
• 70
Entering edit mode
1

Are the primers more or less always in the same place? I wondering if you can use something like umi_tools or a variant of our demultiplexing script for RELACS data to handle this.

ADD REPLYlink 19 months ago
Devon Ryan
90k
Entering edit mode
0

Yes, sure. The primers are at the 5' end of forward and reverse reads.

ADD REPLYlink 19 months ago
Denis
• 70
Entering edit mode
0

Then the options I mentioned should work (possibly with some tweaks) too.

ADD REPLYlink 19 months ago
Devon Ryan
90k
2
Entering edit mode

Denis : Since you edited this post to bump it to main page again I am going to assume that you have not been able to find a solution as yet.

I can think of using the filtering option of bbduk.sh (guide here) in a slightly complex way.
Step 1: Filter R1 reads containing AGTCCGTATATGCCGAG using literal=AGTCCGTATATGCCGAG outm=file_R1.fq.gz option.
Step 2: Filter R2 reads containing GAGATGGACTACTCAC using literal=GAGATGGACTACTCAC outm=file_R2.fq.gz option.
Step 3: Use repair.sh in1=file_R1.fq.gz in2=file_R2.fq.gz out1=final_R1.fq.gz out2=final_R2.fq.gz repair to generate a final file containing R1/R2 reads that match to get the final results file. (Note: You may need plenty of memory depending on size of the data).

ADD COMMENTlink 19 months ago genomax 68k
Entering edit mode
0

Hi genomax! Much appreciated for your help and providing feasible solution.

ADD REPLYlink 19 months ago
Denis
• 70
1
Entering edit mode

You could use cutadapt or sabre http://cutadapt.readthedocs.io/en/stable/ https://github.com/najoshi/sabre

There are probably more options

ADD COMMENTlink 19 months ago gb • 780
Entering edit mode
0

Hi gb,

Thanks for reply. It seems sabre doesn't support dual index Illumina technology. Am i right? Have to check cutadapt documentation.

ADD REPLYlink 19 months ago
Denis
• 70
Entering edit mode
1

This is the demultiplex part http://cutadapt.readthedocs.io/en/stable/guide.html#demultiplexing

I am not sure about the dual index. But sabre and cutadapt can be used for paired end reads. What kind of data is it? amplicon sequencing? In this case I mostly merge the reads first with FLASH and do the the demultiplex afterwards. If the tools do not support dual indexes you can maybe do the process twice. First on the forward index and after that on the reverse.

ADD REPLYlink 19 months ago
gb
• 780
Entering edit mode
0

Ah! I see now that it is about PCR primers, already thought so because a lot of times the illumina indexes are already trimmed off. The merging that I mentioned makes things easier but it also depends on the length of the target so keep that in mind. If your target is 600 bases there will be no or not enough overlap to merge. So in that case it is not a good idea.

ADD REPLYlink 19 months ago
gb
• 780

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0