Question

Split fastq according to barcodes

0

Entering edit mode

5.3 years ago

Hughie ▴ 30

Hello, everyone:

I'm recently analyze my scRNA-seq data, the first step is to splitting fastq files according to my barcode file which looks like this:

sc1 AACGTGAT
sc2 AAACATCG
sc3 ATGCCTAA
sc4 AGTGGTCA
sc5 ACCACTGT
sc6 ACATTGGC
sc7 CAGATCTG
sc8 CATCAAGT
sc9 CGCTGATC
sc10    ACAAGCTA
sc11    CTGTAGCC
sc12    AACGCTTA

My data is pair end sequenced and the R1, R2 are like these (I trimmed some):
R2:

@ST-E00493:75:H33JKALXX:1:1101:10987:2206 2:N:0:ATACACAT    
AACGCTTAAGGGTAATTTTTTGTGTTATGTATTTTTTTTTTAGGGGAAAAGGCATTTTTGGT...
+
AAFFFFJJ<A7JF<JF----AA--A--7----AAFJ-F<-FF-<<F-<-AFFA-7A7A-A-<...

R1:

@ST-E00493:75:H33JKALXX:1:1101:10987:2206 1:N:0:ATACACAT
GTTGTGAAGGGGAGGCTGGAGAGGCTTCGTCTGCTAAGAGCATTGGCCGTTCTTCCACTGTT...
+
AAAFFFJ-<JJJJJJJJFJJJF7JFFJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJFFJJJ...

The barcode information is in the first 8bp of R2 (Here is AACGCTTA), so, I want to split the fastq file according to the barcode informations and pair the read_1 to read_2 by header info. But after I searched many programes or scripts I can't find a suitable solution:
fastq-multx

fastq-multx -B barcode_sequence -b -m 0 R2.fastq.gz R1.fastq.gz -o %_R1.fq -o %_R2.fq
The result is absolutely not what I want which only 7 lines head with its barcode.

fastx_barcode_splitter.pl It seems don't spport PE reads.

BBmap

I also wrote a python script, but it runs so slow.... , So, I wonder if somebody have good suggestions. Thanks in advance!

fastq demultiplexing scRNA-seq • 12k views

ADD COMMENT • link updated 2.2 years ago by camelest ▴ 50 • written 5.3 years ago by Hughie ▴ 30

1

Entering edit mode

Isn't the barcode the last part of the header, in your example ATACACAC and ATACACAT?

If so demuxbyname.sh should help.

fin swimmer

ADD REPLY • link 5.3 years ago by finswimmer 16k

0

Entering edit mode

No, it's in the head 8bp of read_2

ADD REPLY • link 5.3 years ago by Hughie ▴ 30

1

Entering edit mode

Index sequences are in both R1/R2 headers (see example you posted above). As suggested by @finswimmer demuxbyname.sh should indeed work in this case.

ADD REPLY • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

Thanks for your reply, genomax , but this protocol is modified, so the barcode is in the head 8bp of Read_2. At the very beginning when I recieved the data, I also thought the barcode is in the header of both reads, so weird a protocol :(

ADD REPLY • link 5.3 years ago by Hughie ▴ 30

1

Entering edit mode

use fastx tool kit: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_barcode_splitter_usage

ADD REPLY • link 5.3 years ago by cpad0112 21k

0

Entering edit mode

Thanks cpad0112 for your reply, I have tried this method before but it can only handle SE data, and I have to get a script for pairing read1 and read2.

ADD REPLY • link 5.3 years ago by Hughie ▴ 30

0

Entering edit mode

you may want to look at this python library ( I haven't used it): https://pypi.org/project/barcode-splitter/

ADD REPLY • link 5.3 years ago by cpad0112 21k

0

Entering edit mode

I also wrote a python script, but it runs so slow

show us the code

ADD REPLY • link 5.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

The first version I wrote is too slow, and I found a python script paired-end_barcode_splitter from github. It uses a perl script named "fastx_barcode_splitter.pl" from FASTX-toolkit to split my read_2 data according to barcode information and after that, fastq header was used to match the paired reads. But It's only litter faster and it takes about 10h to run my 40GB PE gzipped data. so, I wonder if anyone has a better solution, thanks a lot.

ADD REPLY • link 5.3 years ago by Hughie ▴ 30

0

Entering edit mode

Thanks for the clarification, @RamRS

ADD REPLY • link 3.7 years ago by frederick.boehm • 0

0

Entering edit mode

Hi, it looks like this post is the right place where to find a solution to my problem. I am trying to do the same Houyu needed, with the difference that in my FASTQ files the barcodes are in the first 16nt of the second line in the R1 files. Thus, I have inverted the order of the R1/R2 files in fin swimmer's script and modified 8>16. However, only the last cell R1/R2 files are saved (i.e.there are 1503 cells and only SC1503_1/2.fastq files are saved), any help please?

paste <(zcat 10X_PBMC_R2.fastq.gz|paste - - - -) <(zcat 10X_PBMC_R1.fastq.gz|paste - - - - )|awk -v FS="\t" -v OFS="\n" 'FNR==NR {samples[$2]=$1; next} {barcode = substr($6,0,16); if(samples[barcode]) { print $1,$2,$3,$4>>(samples[barcode]"_2.fastq"); print $5,$6,$7,$8>>(samples[barcode]"_1.fastq")}}' samples.txt -

Thanks a lot in advance!

SV

ADD REPLY • link updated 3.6 years ago by GenoMax 141k • written 3.6 years ago by SV • 0

0

Entering edit mode

Hi, did you solve the problem? I have the same problem, the output only contains the last barcode.

ADD REPLY • link 2.2 years ago by camelest ▴ 50

0

Entering edit mode

Have you solved your problem? I met the same problem as you. I would like to ask you how to solve it

ADD REPLY • link 2.7 years ago by 17318598206 ▴ 20

0

Entering edit mode

Please see the solution by finswimmer below.

ADD REPLY • link 2.7 years ago by Hughie ▴ 30

score 5 · Accepted Answer · 2018-12-16

5

Entering edit mode

5.3 years ago

finswimmer 16k

Here a unix commandline only version:

$ paste <(zcat input_1.fastq.gz|paste - - - -) <(zcat input_2.fastq.gz|paste - - - - )|awk -v FS="\t" -v OFS="\n" 'FNR==NR {samples[$2]=$1; next} {barcode = substr($6,0,8); if(samples[barcode]) { print $1,$2,$3,$4>>samples[barcode]"_1.fastq"; print $5,$6,$7,$8>>samples[barcode]"_2.fastq"}}' samples.txt -

First we put the 4 lines, that belong to one read into one line
Then we put these 4 columns per read from the read1 file side-by-side with the 4 columns of the read2 file. The result is, that we have 8 columns per read pairs which are piped to awk
awk first reads in the file with the barcodes and names
After that it have a look at the first 8 characters of column 6 (which is the sequence of R2). If this is a barcode of a sample, we print the first 4 column to a file for read1 and the last 4 columns for read2. The delimiter is now converted back to new line.

You may want to compress the fastq files afterwards:

$ parallel 'bgzip {}' ::: *.fastq

fin swimmer

ADD COMMENT • link 5.3 years ago by finswimmer 16k

1

Entering edit mode

To run this command line on the macOs terminal, the output file names need to be parenthesized or awk gives an error

(samples[barcode]"_1.fastq") and (samples[barcode]"_2.fastq")

ADD REPLY • link 3.8 years ago by tomsauv ▴ 10

0

Entering edit mode

BTW, if you're on macOS, you should really be using homebrew to install GNU coreutils, as macOS defaullt utilities are inconsistent and lack in features (plus, require special changes like the one you made above to make commands work like they do in most other linux distros). See: https://www.topbug.net/blog/2013/04/14/install-and-use-gnu-command-line-tools-in-mac-os-x/

ADD REPLY • link 3.8 years ago by Ram 43k

0

Entering edit mode

Thanks a lot for your reply fin swimmer! This solution is so amazing, and it's what I'm looking for. I have tested it on my server and I will report the running time later. By the way, linux command is so powerful and also complicated which I have a long way to go...

ADD REPLY • link 5.3 years ago by Hughie ▴ 30

0

Entering edit mode

It's 10 times faster than my previous scripts and it only toke ~1h to split my 40G gzipped PE data! So, amazing! Thank you again :)

ADD REPLY • link 5.3 years ago by Hughie ▴ 30

0

Entering edit mode

Can you offer a little more explanation of the awk call? I'm not yet fluent in its syntax. It seems that the code:

samples[$2]=$1; next

is defining the object "samples", but I can't discern the details of how "samples" is defined. What exactly are $2 and $1 in the above?

Also, why do you need to use both "samples.txt" and "-", as at the end of the above line?

I ask because I wish to modify your code to achieve a similar goal. I appreciate any clarifications that you might offer.

ADD REPLY • link 3.7 years ago by frederick.boehm • 0

3

Entering edit mode

Those conventions are specific to awk and shell programming. awk splits each input record (line) by a "field/column separator" (comma for CSV, tabs for tab-delimited files etc.) and then allows accessing each field using its position. $1 is the first field/column, $2 is the second field/column, and so on.

samples[$2]=$1

creates an array named samples. I'm guessing it's an associative array/hash/dictionary with the second column as keys and first column as values.

The description states quite clearly that 8 columns are passed to awk - the first two columns come from samples.txt and the remaining from the zcat | paste operations, which are passed to awk using the - parameter. The - parameter is a placeholder to let the shell know where to pass the output from the previous pipe as input to the current command.

ADD REPLY • link 3.7 years ago by Ram 43k

0

Entering edit mode

Thank you, @RamRS, for the helpful comments!

ADD REPLY • link 3.7 years ago by frederick.boehm • 0

0

Entering edit mode

I think it's also worth noting that

samples[$2]=$1; next

only operates on the first of the two input files, ie, samples.txt. Thus, $2 refers to the second column of samples.txt while $1 is the first column of samples.txt.

ADD REPLY • link 3.7 years ago by frederick.boehm • 0

1

Entering edit mode

only operates on the first of the two input files

That is not accurate. It operates on all input data, it's just that these 2 columns come from the first input file.

ADD REPLY • link 3.7 years ago by Ram 43k