Biostar Beta. Not for public use.
Question: Extract, combine substring and concatenate in the beginning of the same row
0
Entering edit mode

I have a file with sequences as following, and I hope to extract the first 8 letters after 1:N:0:, 8 letters before "+" sign (the same line as 1:N:0:), 8 letters after "+" sign (the same line as 1:N:0:) and 8 letters in the end of the same line. Then, I need to combine these 8+8+8+8 = 32 letters together, and insert it after "@" in the same row, and with ":" in the end of the substring. Can any one give me suggestion? I plan to use bash, or python, or R to deal with this. Thanks!

Original:

@M01581:1209:000000000-D3YJT:1:1102:16163:1449 2:N:0:TCTCGCGCGGACAGGGACAGCCGCGCCCACGCTACTTACTCCT+TTAAGGAGTCGTCGGCAGCGTCTCCACGCTGCTCTGA
TTTTCTTCTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTCTTTCC
+
1111113331311100A0////////////////0111221121

Results:

@TCTCGCGCTTACTCCTTTAAGGAGTGCTCTGA:M01581:1209:000000000-D3YJT:1:1102:16163:1449 2:N:0
TTTTCTTCTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTCTTTCC
+
1111113331311100A0////////////////0111221121
ADD COMMENTlink 22 months ago niu.shengyong • 30 • updated 22 months ago Alex Reynolds 28k
Entering edit mode
0

python would be the most appropriate of the three tools you listed. Read in 3 lines and process them as a group.

I have no idea why you're using such malformed fastq files or what this has to do with ATAC-seq.

ADD REPLYlink 22 months ago
Devon Ryan
90k
Entering edit mode
0

Hi Devon,

Thanks for your advice. I'm now running single cell ATAC-seq analysis, and after bcl2fq process, it would output the fastq file with Tn5 and PCR barcodes like my post. Hence, I'm finding a way to extract and concatenate the barcodes in the read name.

ADD REPLYlink 22 months ago
niu.shengyong
• 30
Entering edit mode
0

Here's an example python script for finding and moving barcodes. You can adapt it to your needs (though have a look at whether umitools can do what you want).

ADD REPLYlink 22 months ago
Devon Ryan
90k
Entering edit mode
0

Thanks! I solve this problem!

ADD REPLYlink 22 months ago
niu.shengyong
• 30
0
Entering edit mode

One way to do this is just parse the first of every four lines and leave the others untouched:

#!/usr/bin/env python                                                                                                                              

import sys

l = 0
for e in sys.stdin:
    e = e.rstrip()
    if l % 4 == 0:
        p = e[1:].split(':')
        h = ':'.join(p[0:-1])
        t = p[-1].split('+')
        e = []
        e.append('@')
        e.append(t[0][0:8])
        e.append(t[0][-8:])
        e.append(t[1][0:8])
        e.append(t[1][-8:])
        e.append(':')
        e.append(h)
        e = ''.join(e)
    sys.stdout.write('%s\n' % (e))
    l += 1

Example input:

$ cat input.fq
@M01581:1209:000000000-D3YJT:1:1102:16163:1449 2:N:0:TCTCGCGCGGACAGGGACAGCCGCGCCCACGCTACTTACTCCT+TTAAGGAGTCGTCGGCAGCGTCTCCACGCTGCTCTGA
TTTTCTTCTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTCTTTCC
+
1111113331311100A0////////////////0111221121

Example output:

$ ./rewrite_fq.py < input.fq
@TCTCGCGCTTACTCCTTTAAGGAGTGCTCTGA:M01581:1209:000000000-D3YJT:1:1102:16163:1449 2:N:0
TTTTCTTCTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTCTTTCC
+
1111113331311100A0////////////////0111221121

Or if you have compressed FASTQ, use a bash process substitution to feed uncompressed standard input to the script:

$ ./rewrite_fq.py < <(gunzip -c input.fq.gz)

To write the result to a compressed file:

$ ./rewrite_fq.py < <(gunzip -c input.fq.gz) | gzip -c > output.fq.gz

Etc.

ADD COMMENTlink 22 months ago Alex Reynolds 28k
0
Entering edit mode

sed solution: output:

$ sed '/@/ s/\@\(.*\)\:\([A-Z]\{8\}\).*\([A-Z]\{8\}\)+\([A-Z]\{8\}\).*\([A-Z]\{8\}\)/\@\2\3\4\5\:\1/g' test.fastq 

@TCTCGCGCTTACTCCTTTAAGGAGTGCTCTGA:M01581:1209:000000000-D3YJT:1:1102:16163:1449 2:N:0
TTTTCTTCTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTCTTTCC
+
1111113331311100A0////////////////0111221121

input:

$ cat test.fastq 

@M01581:1209:000000000-D3YJT:1:1102:16163:1449 2:N:0:TCTCGCGCGGACAGGGACAGCCGCGCCCACGCTACTTACTCCT+TTAAGGAGTCGTCGGCAGCGTCTCCACGCTGCTCTGA
TTTTCTTCTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTCTTTCC
+
1111113331311100A0////////////////0111221121
ADD COMMENTlink 22 months ago cpad0112 11k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0