Question

How to split fastq reads with different barcode including "N"

0

Entering edit mode

4.9 years ago

Garen • 0

Recently, I got a whole lane of NGS sequencing data. However, this lane is including two different barcode: GCCAAT and GGCTAC. I want to split those reads into two samples according to their barcode. But the label of my NGS data, as followed:

@ST-E00159:619:HYT2MCCXY:2:1101:5649:1502 2:N:0:NCCAAT

NATTAAGTCATTTATCCGAATTAGGAGGAAATAAATTTCTCAAGAAAACAAAGCATTTAATTGAGATCAGACAAAAATAGTGGAAAATAGAAAGTGGTCTGCTTAAATTATCAGACAAATGATAGAAAAGAATTCTACTAGCGAAATTTA

+

The label includes barcode " NCCAAT", so my question is how to separate those reads into two samples? Because the barcode is not exactly matched with my barcode (GCCAAT or GGCTAC), they are including several "N" ( one, two or three times). I have counts the most barcode in the label:

158802022 GCCNAT

66757482 GNCANT

54440177 GCNANT

51615443 GCCANT

7732699 GNNANT

3237294 NCCNAT

3184852 GCCCAT

1717301 GNCCNT

1691456 TAGNTT

Could anyone give me some idea to solve this problem?

Thanks.

demultiplexing • 3.0k views

ADD COMMENT • link 4.9 years ago by Garen • 0

0

Entering edit mode

Essentially, this is demultiplexing based on fastq headers,

Demultiplexing Fastq based on barcodes on identifier line

Demultiplexing reads with index present in the labels

plus allowing mismatches,

Demultiplexing fastq file on Identifier line allowing mismatch

Eventually, you probably will have to make a list of 7 sequences per barcode with the first being the original barcode and the other 6 having one N at any of the 6 nucleotide positions, followed by demultiplexing with this list as suggested in the threads above.

ADD REPLY • link 4.9 years ago by ATpoint 81k

0

Entering edit mode

Thanks for your help. I have carefully check those posters, however, it is not suit for my problem. Most solutions can only solve the barcode without "N", such as demuxbyname.sh. While another solution seemed only solved barcode with one "N". In my situation, my barcodes are including several "N" (<=3). So, is any other convenient way to solve this problem?

ADD REPLY • link 4.9 years ago by Garen • 0

0

Entering edit mode

Note that once you ignore bases 1,3 and 5 of the index, exhaustively listing every acceptable barcode for each sample is not too hard.

ADD REPLY • link 4.9 years ago by swbarnes2 14k

0

Entering edit mode

Eventually, you probably will have to make a list of 7 sequences per barcode with the first being the original barcode and the other 6 having one N at any of the 6 nucleotide positions, followed by demultiplexing with this list as suggested in the threads above.

If you have >1 N, make lists with all possible combinations.

ADD REPLY • link 4.9 years ago by ATpoint 81k

0

Entering edit mode

Thanks, I will try as your advise.

ADD REPLY • link 4.9 years ago by Garen • 0