I have a file with sequences as following, and I hope to extract the first 8 letters after 1:N:0:, 8 letters before "+" sign (the same line as 1:N:0:), 8 letters after "+" sign (the same line as 1:N:0:) and 8 letters in the end of the same line. Then, I need to combine these 8+8+8+8 = 32 letters together, and insert it after "@" in the same row, and with ":" in the end of the substring. Can any one give me suggestion? I plan to use bash, or python, or R to deal with this. Thanks!
Original:
@M01581:1209:000000000-D3YJT:1:1102:16163:1449 2:N:0:TCTCGCGCGGACAGGGACAGCCGCGCCCACGCTACTTACTCCT+TTAAGGAGTCGTCGGCAGCGTCTCCACGCTGCTCTGA
TTTTCTTCTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTCTTTCC
+
1111113331311100A0////////////////0111221121
Results:
@TCTCGCGCTTACTCCTTTAAGGAGTGCTCTGA:M01581:1209:000000000-D3YJT:1:1102:16163:1449 2:N:0
TTTTCTTCTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTCTTTCC
+
1111113331311100A0////////////////0111221121
python would be the most appropriate of the three tools you listed. Read in 3 lines and process them as a group.
I have no idea why you're using such malformed fastq files or what this has to do with ATAC-seq.
Hi Devon,
Thanks for your advice. I'm now running single cell ATAC-seq analysis, and after bcl2fq process, it would output the fastq file with Tn5 and PCR barcodes like my post. Hence, I'm finding a way to extract and concatenate the barcodes in the read name.
Here's an example python script for finding and moving barcodes. You can adapt it to your needs (though have a look at whether umitools can do what you want).
Thanks! I solve this problem!