Extract, combine substring and concatenate in the beginning of the same row
2
0
Entering edit mode
6.0 years ago

I have a file with sequences as following, and I hope to extract the first 8 letters after 1:N:0:, 8 letters before "+" sign (the same line as 1:N:0:), 8 letters after "+" sign (the same line as 1:N:0:) and 8 letters in the end of the same line. Then, I need to combine these 8+8+8+8 = 32 letters together, and insert it after "@" in the same row, and with ":" in the end of the substring. Can any one give me suggestion? I plan to use bash, or python, or R to deal with this. Thanks!

Original:

@M01581:1209:000000000-D3YJT:1:1102:16163:1449 2:N:0:TCTCGCGCGGACAGGGACAGCCGCGCCCACGCTACTTACTCCT+TTAAGGAGTCGTCGGCAGCGTCTCCACGCTGCTCTGA
TTTTCTTCTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTCTTTCC
+
1111113331311100A0////////////////0111221121

Results:

@TCTCGCGCTTACTCCTTTAAGGAGTGCTCTGA:M01581:1209:000000000-D3YJT:1:1102:16163:1449 2:N:0
TTTTCTTCTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTCTTTCC
+
1111113331311100A0////////////////0111221121
ATAC single cell string barcode • 1.4k views
ADD COMMENT
0
Entering edit mode

python would be the most appropriate of the three tools you listed. Read in 3 lines and process them as a group.

I have no idea why you're using such malformed fastq files or what this has to do with ATAC-seq.

ADD REPLY
0
Entering edit mode

Hi Devon,

Thanks for your advice. I'm now running single cell ATAC-seq analysis, and after bcl2fq process, it would output the fastq file with Tn5 and PCR barcodes like my post. Hence, I'm finding a way to extract and concatenate the barcodes in the read name.

ADD REPLY
0
Entering edit mode

Here's an example python script for finding and moving barcodes. You can adapt it to your needs (though have a look at whether umitools can do what you want).

ADD REPLY
0
Entering edit mode

Thanks! I solve this problem!

ADD REPLY
0
Entering edit mode
6.0 years ago

One way to do this is just parse the first of every four lines and leave the others untouched:

#!/usr/bin/env python                                                                                                                              

import sys

l = 0
for e in sys.stdin:
    e = e.rstrip()
    if l % 4 == 0:
        p = e[1:].split(':')
        h = ':'.join(p[0:-1])
        t = p[-1].split('+')
        e = []
        e.append('@')
        e.append(t[0][0:8])
        e.append(t[0][-8:])
        e.append(t[1][0:8])
        e.append(t[1][-8:])
        e.append(':')
        e.append(h)
        e = ''.join(e)
    sys.stdout.write('%s\n' % (e))
    l += 1

Example input:

$ cat input.fq
@M01581:1209:000000000-D3YJT:1:1102:16163:1449 2:N:0:TCTCGCGCGGACAGGGACAGCCGCGCCCACGCTACTTACTCCT+TTAAGGAGTCGTCGGCAGCGTCTCCACGCTGCTCTGA
TTTTCTTCTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTCTTTCC
+
1111113331311100A0////////////////0111221121

Example output:

$ ./rewrite_fq.py < input.fq
@TCTCGCGCTTACTCCTTTAAGGAGTGCTCTGA:M01581:1209:000000000-D3YJT:1:1102:16163:1449 2:N:0
TTTTCTTCTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTCTTTCC
+
1111113331311100A0////////////////0111221121

Or if you have compressed FASTQ, use a bash process substitution to feed uncompressed standard input to the script:

$ ./rewrite_fq.py < <(gunzip -c input.fq.gz)

To write the result to a compressed file:

$ ./rewrite_fq.py < <(gunzip -c input.fq.gz) | gzip -c > output.fq.gz

Etc.

ADD COMMENT
0
Entering edit mode
6.0 years ago

sed solution: output:

$ sed '/@/ s/\@\(.*\)\:\([A-Z]\{8\}\).*\([A-Z]\{8\}\)+\([A-Z]\{8\}\).*\([A-Z]\{8\}\)/\@\2\3\4\5\:\1/g' test.fastq 

@TCTCGCGCTTACTCCTTTAAGGAGTGCTCTGA:M01581:1209:000000000-D3YJT:1:1102:16163:1449 2:N:0
TTTTCTTCTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTCTTTCC
+
1111113331311100A0////////////////0111221121

input:

$ cat test.fastq 

@M01581:1209:000000000-D3YJT:1:1102:16163:1449 2:N:0:TCTCGCGCGGACAGGGACAGCCGCGCCCACGCTACTTACTCCT+TTAAGGAGTCGTCGGCAGCGTCTCCACGCTGCTCTGA
TTTTCTTCTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTCTTTCC
+
1111113331311100A0////////////////0111221121
ADD COMMENT

Login before adding your answer.

Traffic: 2516 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6