Hi all,
I will be using Drop-seq to prepare cDNA libraries for thousands of single cells, followed by single-end sequencing on a HiSeq2500 (read length = 100 bases). This will involve the addition of a unique 12 nucleotide cell barcode to the 5' end of all reads originating from the same cell (so bases 1-12 of each read). Unique molecular identifiers will also be used (they will be bases 13-20 of the reads). I won't be able to use the pipeline designed by the creators of drop-seq because they require paired-end sequencing. My problem is with demultiplexing: I am using randomly generated cell barcodes which will be supplied in excess, so I have no way of knowing beforehand which barcodes were actually used. Because of this, I am unable to supply the barcode sequence information that most scripts out there require as input. As someone with minimal computational/bioinformatics skills, I am not comfortable writing custom scripts to fit my needs.
Does anyone know of any scripts/packages that I would be able to use to demultiplex my data based on the in-line cell barcodes? I am planning on using the python package UMI tools (https://github.com/CGATOxford/UMI-tools) to extract the UMIs and deduplicate reads, but I have not been able to find any information on how to separate the reads from different cells into distinct fastq files.
Thanks in advance for any recommendations!
Hi Igor, your zcat solution appears to have done the trick. I used it to generate a barcode list, which I then used as the barcode input file for fastx_barcode_splitter (http://hannonlab.cshl.edu/fastx_toolkit/commandline.html). I now have demultiplexed fastq files. One strange thing though, when I aligned the fastqs with bowtie2 I ended up getting 100% of reads aligning >1 times. The reads that I'm using at the moment are very short (12 bases after trimming off barcodes/UMIs, they were designed with a different analysis pipeline in mind) so I wonder if that might have something to do with the strange alignments. Anyway, I think once I generate my own data I will be able to troubleshoot the issue better.
It's not strange. It's very hard to get a unique alignment with 12 bases. 100% multi-mapping seems a bit excessive, but I wouldn't have expected less than 90%. I don't think you can expect a high fraction of unique alignments below 25 bases.
See earlier discussion here: Length Of Read Needed To Confidently Map Sequence