How to split fastq reads with different barcode including "N"
0
0
Entering edit mode
4.9 years ago
Garen • 0

Recently, I got a whole lane of NGS sequencing data. However, this lane is including two different barcode: GCCAAT and GGCTAC. I want to split those reads into two samples according to their barcode. But the label of my NGS data, as followed:

@ST-E00159:619:HYT2MCCXY:2:1101:5649:1502 2:N:0:NCCAAT

NATTAAGTCATTTATCCGAATTAGGAGGAAATAAATTTCTCAAGAAAACAAAGCATTTAATTGAGATCAGACAAAAATAGTGGAAAATAGAAAGTGGTCTGCTTAAATTATCAGACAAATGATAGAAAAGAATTCTACTAGCGAAATTTA

+

The label includes barcode " NCCAAT", so my question is how to separate those reads into two samples? Because the barcode is not exactly matched with my barcode (GCCAAT or GGCTAC), they are including several "N" ( one, two or three times). I have counts the most barcode in the label:

158802022 GCCNAT

66757482 GNCANT

54440177 GCNANT

51615443 GCCANT

7732699 GNNANT

3237294 NCCNAT

3184852 GCCCAT

1717301 GNCCNT

1691456 TAGNTT

Could anyone give me some idea to solve this problem?

Thanks.

demultiplexing • 3.0k views
ADD COMMENT
0
Entering edit mode

Essentially, this is demultiplexing based on fastq headers,

Demultiplexing Fastq based on barcodes on identifier line

Demultiplexing reads with index present in the labels

plus allowing mismatches,

Demultiplexing fastq file on Identifier line allowing mismatch

Eventually, you probably will have to make a list of 7 sequences per barcode with the first being the original barcode and the other 6 having one N at any of the 6 nucleotide positions, followed by demultiplexing with this list as suggested in the threads above.

ADD REPLY
0
Entering edit mode

Thanks for your help. I have carefully check those posters, however, it is not suit for my problem. Most solutions can only solve the barcode without "N", such as demuxbyname.sh. While another solution seemed only solved barcode with one "N". In my situation, my barcodes are including several "N" (<=3). So, is any other convenient way to solve this problem?

ADD REPLY
0
Entering edit mode

Note that once you ignore bases 1,3 and 5 of the index, exhaustively listing every acceptable barcode for each sample is not too hard.

ADD REPLY
0
Entering edit mode

Eventually, you probably will have to make a list of 7 sequences per barcode with the first being the original barcode and the other 6 having one N at any of the 6 nucleotide positions, followed by demultiplexing with this list as suggested in the threads above.

If you have >1 N, make lists with all possible combinations.

ADD REPLY
0
Entering edit mode

Thanks, I will try as your advise.

ADD REPLY

Login before adding your answer.

Traffic: 3161 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6