Question

How to split a fastq file into each corresponding sample.fastq?

0

Entering edit mode

6.6 years ago

tiago211287 ★ 1.4k

I have a illumina hiseq2500 run with Forward and Reverse files. Each file contains many samples with two indexes located at the read label, as in:

@GHAY-HISEQ2:5:2308:17910:42054#CCTAGAAT-AGGTAAGG/2;1

I'm looking for a tool that could help me split each fastq sample inside this BIG fastq based on the 2 indexes. Anyone have ever faced this problem?

Thanks.

demultiplexing fastq illumina • 9.2k views

ADD COMMENT • link updated 6.6 years ago by Dan D 7.4k • written 6.6 years ago by tiago211287 ★ 1.4k

0

Entering edit mode

Could you paste example read record for each index?

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

@GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1
NGCATGAACGGCTAAACGAGGGTCCAACTGTCTCTTATCT
+GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1
B[[aaeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

ADD REPLY • link 6.6 years ago by tiago211287 ★ 1.4k

1

Entering edit mode

Input fastq:

$ cat test.fq 
@GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1
NGCATGAACGGCTAAACGAGGGTCCAACTGTCTCTTATCT
+GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1
B[[aaeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
@GHAY-HISEQ2:5:2308:2551:1934#CCTGGATA-TGCTCGAC/1;1
NAGCTGGAATTACCGCGGCTGCTGGCACCAGACTTGCCCT
+GHAY-HISEQ2:5:2308:2551:1934#CCTGGATA-TGCTCGAC/1;1
B[[[aeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

barcodes.txt:

$ cat barcodes.txt 
TTGCTGGA-ACCAACTG
GTTCATTA-AGGTAAGG
GAGAGTTG-AGGTAAGG
CATCAGAC-AGGTAAGG

code:

$ for i in $(cat barcodes.txt);do seqkit grep -r -p $i test.fq > $i.fq ;done

Seqkit is available here

There will be 4 output files: one for each bar code in barcodes.txt (with extension each barcode.fq-- eg. GAGAGTTG-AGGTAAGG.fq ). Reads with matching barcodes will be present in each barcode.fq.

Since there is only one match with barcodes.txt, only TTGCTGGA-ACCAACTG.fq will have reads inside it, rest fastq will be empty.

$ cat TTGCTGGA-ACCAACTG.fq 
@GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1
NGCATGAACGGCTAAACGAGGGTCCAACTGTCTCTTATCT
+
B[[aaeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

However, seqkit output is truncating text after +, for each read and IMHO, this is not an issue as the text is duplicated in header line (starting with @).

ADD REPLY • link 6.6 years ago by cpad0112 21k

0

Entering edit mode

6.6 years ago

Dan D 7.4k

The FASTX toolkit is way out of date, but I think FASTX Barcode Splitter may be what you need. As long as it works with your specific FASTQ format you're all set.

ADD COMMENT • link 6.6 years ago by Dan D 7.4k

0

Entering edit mode

I was reading the manual of this tool but it seemed to me that it only look for barcodes at the read sequence, not in the labels. Or am I wrong?

ADD REPLY • link 6.6 years ago by tiago211287 ★ 1.4k

score 4 · Accepted Answer · 2017-09-03

4

Entering edit mode

6.6 years ago

GenoMax 141k

This is a pretty non-standard way of supplying index sequences (not sure how the /2;1 came into existence.

You can try demuxbyname.sh from BBMap suite.

ADD COMMENT • link 6.6 years ago by GenoMax 141k

0

Entering edit mode

The tool seems perfect for the job, but for some reason my barcodes do not match the reads and produce 0 size files. Can you tell me what am I doing wrong?

demuxbyname.sh in=s_5_#_sequence.fastq.gz out=sequence.test/out_%_#.fq prefixmode=f names=barcodes2.txt


head barcodes2.txt
TTGCTGGA-ACCAACTG
GTTCATTA-AGGTAAGG
GAGAGTTG-AGGTAAGG
CATCAGAC-AGGTAAGG


zcat * | head 

@GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1
NGCATGAACGGCTAAACGAGGGTCCAACTGTCTCTTATCT
+GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1
B[[aaeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
@GHAY-HISEQ2:5:2308:2551:1934#CCTGGATA-TGCTCGAC/1;1
NAGCTGGAATTACCGCGGCTGCTGGCACCAGACTTGCCCT
+GHAY-HISEQ2:5:2308:2551:1934#CCTGGATA-TGCTCGAC/1;1
B[[[aeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

ADD REPLY • link 6.6 years ago by tiago211287 ★ 1.4k

0

Entering edit mode

The problem seems to be related to this final "/1;1" or "/2;1", after removing it, the application works nicely.

ADD REPLY • link 6.6 years ago by tiago211287 ★ 1.4k

0

Entering edit mode

Just a minor comment, after using a smallsampled.fastq, there was only 1 read match(including quality), but the summary reported 2 reads. Do you know if is this normal? Is the program considering the quality as 1 read?

Time:               0.215 seconds.
Reads Processed:    25  0.12k reads/sec
Bases Processed:    1000        0.00m bases/sec
Reads Out:    2
Bases Out:    80

ADD REPLY • link 6.6 years ago by tiago211287 ★ 1.4k

0

Entering edit mode

Tagging: Brian Bushnell to see if he has any comments

ADD REPLY • link 6.6 years ago by GenoMax 141k

1

Entering edit mode

demuxbyname.sh in=file.fq prefixmode=f length=21 out=barcode_%.fq

That will demultiplex by the last 21 characters of the header without need to specify a barcode file. Alternatively:

demuxbyname.sh in=file.fq names=names.txt substringmode out=barcode_%.fq

That will allow the names to be substrings of the headers, so the trailing "/2;1" will be ignored.

ADD REPLY • link 6.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you very much Brian.

ADD REPLY • link 6.6 years ago by tiago211287 ★ 1.4k

0

Entering edit mode

Oh, I missed your question:

Just a minor comment, after using a smallsampled.fastq, there was only 1 read match(including quality), but the summary reported 2 reads. Do you know if is this normal? Is the program considering the quality as 1 read?

The number of reads out should be correct, and the "+" line is ignored, only read headers (starting with @) are considered. Can you show the contents of the output matching reads, and the first 8 lines of the subsampled input file (I know you've posted parts of the file, I just want to make sure exacly what the contents were in this case), the command, and ideally the full screen output as well, just to see if the reads were detected as paired/interleaved or something like that.

ADD REPLY • link 6.6 years ago by Brian Bushnell 20k

0

Entering edit mode

#In theory 25 sequences \/

wc -l test.fq
100 test.fq

head -n8 test.fq

@GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1
NGCATGAACGGCTAAACGAGGGTCCAACTGTCTCTTATCT
+GHAY-HISEQ2:5:2308:2003:1934#TTGCTGGA-ACCAACTG/1;1
B[[aaeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
@GHAY-HISEQ2:5:2308:2551:1934#CCTGGATA-TGCTCGAC/1;1
NAGCTGGAATTACCGCGGCTGCTGGCACCAGACTTGCCCT
+GHAY-HISEQ2:5:2308:2551:1934#CCTGGATA-TGCTCGAC/1;1
B[[[aeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

 demuxbyname.sh in=test.fq names=barcodes.txt substringmode out=test/barcode_%.fq

Input is being processed as unpaired
Time:               1.013 seconds.
**Reads Processed:    25**  0.02k reads/sec
Bases Processed:    1000        0.00m bases/sec
**Reads Out:    48**
Bases Out:    1920

I think the reads out should be 50. But there was an unmatched:

@GHAY-HISEQ2:5:2308:1155:1935#NNNNNNNN-NNNNNNNN/1;0
NAGTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+GHAY-HISEQ2:5:2308:1155:1935#NNNNNNNN-NNNNNNNN/1;0
B[[[aBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

ADD REPLY • link 6.6 years ago by tiago211287 ★ 1.4k

1

Entering edit mode

After trimming the trailing /1;1 from the input first 8 lines, I got this:

demuxbyname.sh in=x.fq out=y%.fq suffixmode length=21
java -ea -Xmx1200m -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ jgi.DemuxByName in=x.fq out=y%.fq suffixmode length=21
Executing jgi.DemuxByName [in=x.fq, out=y%.fq, suffixmode, length=21]

Changed from ASCII-33 to ASCII-64 on input quality 66 for base N while prescanning.
Input is being processed as unpaired
Time:               0.060 seconds.
Reads Processed:    2   0.03k reads/sec
Bases Processed:    80  0.00m bases/sec
Reads Out:    2
Bases Out:    80

...which is the output I expect. But in substring mode it looks like the output is correct, however everything is counted twice so the counts are wrong. Thanks for bringing that to my attention; I'll fix it soon.