Recently, I got a whole lane of NGS sequencing data. However, this lane is including two different barcode: GCCAAT and GGCTAC. I want to split those reads into two samples according to their barcode. But the label of my NGS data, as followed:
@ST-E00159:619:HYT2MCCXY:2:1101:5649:1502 2:N:0:NCCAAT
NATTAAGTCATTTATCCGAATTAGGAGGAAATAAATTTCTCAAGAAAACAAAGCATTTAATTGAGATCAGACAAAAATAGTGGAAAATAGAAAGTGGTCTGCTTAAATTATCAGACAAATGATAGAAAAGAATTCTACTAGCGAAATTTA
+
The label includes barcode " NCCAAT", so my question is how to separate those reads into two samples? Because the barcode is not exactly matched with my barcode (GCCAAT or GGCTAC), they are including several "N" ( one, two or three times). I have counts the most barcode in the label:
158802022 GCCNAT
66757482 GNCANT
54440177 GCNANT
51615443 GCCANT
7732699 GNNANT
3237294 NCCNAT
3184852 GCCCAT
1717301 GNCCNT
1691456 TAGNTT
Could anyone give me some idea to solve this problem?
Thanks.
Essentially, this is demultiplexing based on fastq headers,
Demultiplexing Fastq based on barcodes on identifier line
Demultiplexing reads with index present in the labels
plus allowing mismatches,
Demultiplexing fastq file on Identifier line allowing mismatch
Eventually, you probably will have to make a list of 7 sequences per barcode with the first being the original barcode and the other 6 having one
N
at any of the 6 nucleotide positions, followed by demultiplexing with this list as suggested in the threads above.Thanks for your help. I have carefully check those posters, however, it is not suit for my problem. Most solutions can only solve the barcode without "N", such as demuxbyname.sh. While another solution seemed only solved barcode with one "N". In my situation, my barcodes are including several "N" (<=3). So, is any other convenient way to solve this problem?
Note that once you ignore bases 1,3 and 5 of the index, exhaustively listing every acceptable barcode for each sample is not too hard.
If you have >1 N, make lists with all possible combinations.
Thanks, I will try as your advise.