Question

Tool:SPLiT-Seq Demultiplexing a bash tool for extraction of barcoded single cells

2

Entering edit mode

5.6 years ago

paulranum11 ▴ 80

Hi Everyone,

I am interested in the recently published SPLiT-Seq single cell/nucleus RNA-Seq paper from Rosenberg et al that came out in Science http://science.sciencemag.org/content/early/2018/03/14/science.aam8999

In this paper the authors describe a custom iterative barcoding strategy to index single cells for single cell RNA-Seq without droplets or cell isolation. They have archived their data on GEO: GSE110823. However I was unable to find any scripts or bioinformatics tools that they used to process the raw data.

I have attempted to throw together a SPLiT-Seq demultiplexing strategy using bash. With the goal of making it more robust, accurate, and functional I wanted to share it with the bioinformatics community. Any comments, criticisms, bug reports, or pointers to any official SPLiT-Seq tools are much appreciated.

Here is a link to the Github repository. https://github.com/paulranum11/splitseq_demultiplexing

RNA-Seq sequencing • 5.1k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 5.6 years ago by paulranum11 ▴ 80

1

Entering edit mode

To flesh this out more as a fully fledged tool, I would suggest showing some run time comparisons for different data sizes etc.

ADD REPLY • link 5.6 years ago by Joe 21k

0

Entering edit mode

The GitHub link appears to be incorrect.

ADD REPLY • link 5.6 years ago by GenoMax 141k

0

Entering edit mode

Thanks, link updated.

ADD REPLY • link 5.6 years ago by paulranum11 ▴ 80

0

Entering edit mode

Is your script generally applicable/easily extensible to other SPLiT-seq datasets or is it for this specific one you mention above?

ADD REPLY • link 5.6 years ago by GenoMax 141k

0

Entering edit mode

It should be directly applicable to any SPLiT-Seq dataset using the published indexes. If new indexes are used a user can simply change the barcodes.txt files to reflect their index sequences.

ADD REPLY • link 5.6 years ago by paulranum11 ▴ 80

0

Entering edit mode

Thanks! You may want to mention that in original post/on GitHub.

ADD REPLY • link 5.6 years ago by GenoMax 141k

0

Entering edit mode

Can you describe how the function of this is different from the standard bcl2fastq? From the looks of your script, you are starting with .fastq files (bcl2fastq starts from .bcl), not sure how much divergence in functionality there is after that step.

ADD REPLY • link 5.6 years ago by steve ★ 3.5k

1

Entering edit mode

I am not an expert on bcl2fastq but I understand that it is used as the first demultiplexing step after Illumina sequencing taking the .bcl files generated by the sequencer then both demultiplexing and converting them to standard .fastq format for each indexed sample. To do this bcl2fastq is capable extracting samples indexed with standard Illumina barcodes from one of their index kits. Because SPLiT-Seq uses a custom 3 stage ligation based barcoding design in addition to standard illumina indexes I assume that bcl2fastq is not capable of demultiplexing the single cells based on this non-illumina indexing strategy. As an example the archived SPLiT-Seq datasets on GEO appear to have had the illumina indexes removed while the split-seq barcodes 1-3 are still there.

I am sure there are similarities as both tools try to demulitplex samples... so I would be very interested to know if bcl2fastq could be made to work on SPLiT-Seq data. Perhaps other people here may have more insight.

ADD REPLY • link 5.6 years ago by paulranum11 ▴ 80

0

Entering edit mode

Hello, firstly I would like to appreciate for your work. But there is some doubts about the Round*_barcodes_new.txt files. I am curious about why round1 barcode is 13 bits and 2 is 13 bits, 3 is 14or 15 bits. We know that the round1 unique barcode is 8bits, yet you add 5 redundant bits which is totally the same in all barcodes.

ADD REPLY • link 5.3 years ago by rainwashautumn • 0

0

Entering edit mode

The 8bit barcode sequences used in the SPLiT-Seq protocol are redundant between the different barcoding Rounds. You can confirm this here https://drive.google.com/file/d/1l3MkxYVWOPTV5KPpq2uXMLzKOKwVeZyw/view?usp=sharing

The barcodes can be redundant because they have flanking sequences used for the ligation that are distinct. As a result the position of each barcode and its flanking sequence in the resulting read is unique. The several bases of flanking sequence added to the 8bit barcode are included in the "barcodes_new3.txt" files to allow the script to detect a match unique to the intended barcode position. For example, a Round2 barcode will not detect a barcode added during Round 1 or Round 3 because they lack the Round2 flanking sequence. If you look at an example read sequence and the barcode sequences that I linked to above this logic will become clear.

ADD REPLY • link 5.3 years ago by paulranum11 ▴ 80

0

Entering edit mode

Thanks a lot for your meticulously reply. Now I have noticed that the full dataset which is used for benchmarking contains only 131 nuclei (according to the paper )while your script generated 14,423 .fastq files (cells) as output. I am new to bioinformatics and hope my question do not bother you a lot.

ADD REPLY • link 5.2 years ago by rainwashautumn • 0

0

Entering edit mode

@rainwashautumn I have also noticed this discrepancy with GSM3017260 (SRR6750041). I emailed the authors of the paper months ago about this point but never received a response. I think the issue may be one of two things 1. the dataset is mislabeled in the GEO archive or 2. They performed additional filtering to somehow arrive at 131 nuclei... eliminating a very large number of barcode combinations... From the testing we have done on "splitseqdemultiplexing.sh" we know that it performs as expected on datasets containing a computationally predefined number of barcode combinations. We also have confidence that each of the barcode combinations identified exist in the .fastq file downloaded from SRR6750041 because "splitseqdemultiplexing.sh" records each combination found and we can confirm its presence in the raw reads.
Finally, we seem to get results much closer to those reported in the Split-seq paper if we process other files for example GSM3017260 (SRR6750042). Anyway you point out a very interesting discrepancy that I have not been able to wrap my head around. If you figure it out please share your findings with us here.

ADD REPLY • link 5.2 years ago by paulranum11 ▴ 80

0

Entering edit mode

HUGE UPDATE to SPLiT-Seq_Demultiplexing released on Jan_16_2019. An enormous increase in speed was achieved through rewriting steps 1 and 2 in python to make use of hash data structures. Support for a user defined number of mismatches was added. Suggestions, comments, and criticisms welcome.

ADD REPLY • link 5.3 years ago by paulranum11 ▴ 80

score 1 · Answer 1 · 2018-09-20

1

Entering edit mode

5.6 years ago

Ram 43k

Hi,

I have a few questions and some feedback on the code:

Questions:

~~What is a "round"?~~

~~Why are there 3 rounds?~~

~~Is that the typical use case or can be there more (or fewer) "rounds"?~~ (I see that the paper answers these questions)
Why use cut -b 1- to read the files into an array?

Code feedback:

I'd recommend switching to files passed as command line arguments and processed using getopts, instead of hard-coding filenames in scripts. That way, you can have your input file wherever
Use set -eo pipefail so the script stops if any step fails
Ensure log and out files can be created (i.e. dir has write permissions), or better yet, just write to STDOUT and STDERR as appropriate
Do not implicitly rm. Instead, create directories of the format prefix.XXX using mktemp [-d]. See here
There are a lot of loops and conditionals, maybe restructure the code so you need fewer loops? Better yet, switch to python. Shell loops are bad (Advanced topic)
Unless you're using regular expressions, grep -F (or grep -Fx if you're matching an entire line) will reduce the load on grep.
You're exporting and parallelizing the grep function, why not do that all over the script instead of just in one place?

ADD COMMENT • link 5.6 years ago by Ram 43k

0

Entering edit mode

Ram, thanks very much for your feedback I appreciate it.

Explanation of "rounds": The SPLiT-Seq science paper describes 3 successive "rounds" of barcoding. I am borrowing the round1, round2, round3, terminology from their publication. Each round of barcoding done in the wet lab is performed through the addition of a specific set of barcode Oligos. The typical use case is 3 rounds of barcodes. Three rounds of barcoding are used to achieve a high number of possible barcode combinations. Round1 (48) x round2 (96) x round3 (96) = 442,368 possible combinations.

Code feedback: Thanks for your detailed comments. I will work on revising the code.

One area that you identify for revision (avoiding shell loops) is an area I have looked at but struggled to revise particularly in the step1 (demultiplexing) part of the script. I created a stack overflow question on this topic here: https://stackoverflow.com/questions/52408588/parallelizing-nested-for-loop-with-gnu-parallel

Thanks for your suggestions!

ADD REPLY • link 5.6 years ago by paulranum11 ▴ 80

2

Entering edit mode

And you got a reply from the author of GNU parallel ;)

Ole won’t lead you wrong!

ADD REPLY • link 5.6 years ago by Joe 21k

0

Entering edit mode

The person on SO has parallelized the innermost loop, I'm guessing you could try applying that logic (abstracting the logic as a function, using input files are arguments and parallelizing the function) to the outer loops as well, although I'm not sure how the task distribution would work in such a implementation.

ADD REPLY • link 5.6 years ago by Ram 43k