Delete entire row with matched barcode string
2
1
Entering edit mode
5.9 years ago

I have a blacklist barcode file and a sam file. I hope to delete the entire row in sam file if its first 32 letters match the barcode in the blacklist. I've tried "grep -vf blackList_barcode.txt hg19.sam", but it's too slow. Is there any good suggestion in any language? (In the following example, the first row in the sam file with be deleted, because the first 32 letters in that row matches the first barcode in blacklist.

blackList_barcode.txt 

AAAAAAAAAAAAAAAACGTCTAATCAGGACGT

AAAAAAAAAAAAAAAACTAAGCCTTAATCTTA

AAAAAAAAAAACAAAATCCTACGGTATAGCCT

AAAAAAAAAAGATTATTCTTAAAACACAACCA

hg19.sam

AAAAAAAAAAACAAAATCCTACGGTATAGCCT:M01581:1209:000000000-D3YJT:1:1102:16735:1704 1:N:0    163     chr17   21903973        3       15=1X28=        =       21904354        425     CAGTGCTTCCCACGGCTGTCTTAGGAACCAGTCCCCGAGGCTTG    1>>>>F3B31B1BAA1AA0BFBA311DE000A11EFEEG///F/    XT:A:R  NM:i:1  AM:i:3

CGGCTATGGGACTCCTCGTCTAATGTACTGAC:M01581:1209:000000000-D3YJT:1:1102:14725:1704 1:N:0    73      chr1    71429475        42      44=     =       71429475
            0       CCTGGTTACTTATGCAAATTTCTGATGCAGGCTTGAATTTCTCC    AA1A1@11@D3DAABGFGGC1EHHHF1FGGCEGHGCEGBFGFAE    NM:i:0  AM:i:0
text processing • 1.9k views
ADD COMMENT
2
Entering edit mode
5.9 years ago

Hello,

is it faster if you only grep for the line start?

grep -v "^AAAAAAAAAAAAAAAACGTCTAATCAGGACGT" hg19.sam

Also using parallel might be a good idea:

parallel -j 8 -k 'samtools view hg19.sam chr{}|grep -vf blackList_barcode.txt hg19.sam' ::: {1..22} X Y

I don't know how fast samtools view is on sam files. Maybe you should first convert it to bam and index it. So random access to the region can be used.

fin swimmer

ADD COMMENT
0
Entering edit mode

How could I use the entire blacklist file if I only grep for the line start? Thanks!

ADD REPLY
0
Entering edit mode

I don't know whether grep have an option for this. You can modify your blacklist file in that way to prepend an ^ to each line.

awk '{print "^"$0}' blackList_barcode.txt > blackList_new.txt
ADD REPLY
2
Entering edit mode
5.9 years ago

LC_ALL=C makes grep a little bit faster and try parallel pipe in lines of 'parallel -k --pipe --block 10M --ungroup LC_ALL=C grep -v - f query.txt target.txt' One can replace --block with lines options -N. Use also -F in grep. Some thing like grep -v -f -F

ADD COMMENT
0
Entering edit mode

Why do you use the --ungroup argument? Thanks!

ADD REPLY
0
Entering edit mode

GNU parallel doesn't show the output till command completion.For getting output immediately, option --ungroup helps. Some times, --ungroup is a problem if files are big. May be --line-buffer or k it self be should be sufficient enough.

ADD REPLY
0
Entering edit mode

Thanks! I have multiple CPU cores. Is there any way can utilize them to enhance the speed?

ADD REPLY
0
Entering edit mode

Without any explicit setting parallel starts as many jobs in parallel as there are CPUs. You can set the number explicit using -j.

ADD REPLY
0
Entering edit mode

parallel -k --pipe -N 1M --ungroup LC_ALL=C grep -v -f -F black_barcode.txt out_hg19.sam > result.sam

I use above command line with 4 CPU, but in the result.sam file, it just show the usage of parallel.

Usage:

parallel [options] [command [arguments]] < list_of_arguments parallel [options] [command [arguments]] (::: arguments|:::: argfile(s))... cat ... | parallel --pipe [options] [command [arguments]]

-j n Run n jobs in parallel -k Keep same order -X Multiple arguments with context replace --colsep regexp Split input on regexp for positional replacements {} {.} {/} {/.} {#} {%} {= perl code =} Replacement strings {3} {3.} {3/} {3/.} {=3 perl code =} Positional replacement strings With --plus: {} = {+/}/{/} = {.}.{+.} = {+/}/{/.}.{+.} = {..}.{+..} = {+/}/{/..}.{+..} = {...}.{+...} = {+/}/{/...}.{+...}

-S sshlogin Example: foo@server.example.com --slf .. Use ~/.parallel/sshloginfile as the list of sshlogins --trc {}.bar Shorthand for --transfer --return {}.bar --cleanup --onall Run the given command with argument on all sshlogins --nonall Run the given command with no arguments on all sshlogins

--pipe Split stdin (standard input) to multiple jobs. --recend str Record end separator for --pipe. --recstart str Record start separator for --pipe.

Besides, the blacklist file has 53000 lines and the out_hg19.sam has 1192807 lines. What's the number should I set for -N argument?

ADD REPLY
1
Entering edit mode

you script has a problem. One uses 1M (or any size) when block option is used. When -N is used number of lines is provided. One uses either one, not a hybrid.

Script should be some like this using block (1M or 2M or nM):

parallel -k --pipe --block 1M  'LC_ALL=C grep -v -f -F black_barcode.txt out_hg19.sam > result.sam'

or when you use lines (-N)(Eg here:100000. It could be any number of user choice and lines in file ):

parallel -k --pipe -N100000  'LC_ALL=C grep -v -f -F black_barcode.txt out_hg19.sam > result.sam'

In addition, I see code for neither CPUS or cores in your code (-j) contrary to what is written (4 CPUs are used in code). In addition to override cores with CPUs, parallel needs to be supplied separate option.

ADD REPLY
0
Entering edit mode

Thanks! What's the meaning of -N100000? In my example of 53000 lines and 1192807 lines in blacklist and sam file, is the number 100000 proper for the execution? Thanks a lot!

ADD REPLY
0
Entering edit mode

Go with 50,000 first and check if things work out. N is number of lines. Following number in N100000 is the number of lines that to be processed at a time.

ADD REPLY
0
Entering edit mode

for using 4 CPU, I should add argument -j 4 , right?

ADD REPLY
1
Entering edit mode

-j 4 uses 4 cores. Not CPUs. Do not confuse CPUs with cores. If you want to use CPUs instead of cores, you should supply "--use-cpus-instead-of-cores" (IMO). Could you please post the output of:

  parallel --number-of-cpus
  parallel --number-of-cores

If you really have 4 cpus and each CPU has multiple cores, you can tell the program not to use cores, instead use CPUS.

ADD REPLY
1
Entering edit mode

Hello,

try this:

cat out_hg19.sam|parallel -k --pipe -N 1M --ungroup 'LC_ALL=C grep -v -f -F black_barcode.txt - ' > result.sam

fin swimmer

ADD REPLY
0
Entering edit mode

I successfully use the command to delete rows with blacklist barcodes.

cat out_mm10.sam | parallel -j 4 -k --pipe -N100000 --ungroup 'LC_ALL=C grep -F -v -f   black_barcode.txt' > mm10_afterFilter.sam

However, when I use

samtools -S -b mm10_afterFilter.sam > mm10_afterFilter.bam

It shows:

[E::sam_parse1] CIGAR and query sequence are of different length [W::sam_read1] Parse error at line 133635 [main_samview] truncated file.

Here is the line 133635, I find that some different lines concatenate together. The original file doesn't have this issue, so I believe the problem is generated from the parallel and grep command. Any suggestion? Thanks!

ATTACTCGTAGGCATGCGTCTAATATAGAGGC:M01581:1209:000000000-D3YJT:1:1102:14469:9952 1:N:0 83 chr20 37591922 42 44= =
37591826 -140 TA
CCACCTGCACCCCTGCAGGTCTGGATGTGCAAAGTATTCAGAAGTAGAGGATATCCTCTCAGGACGT:M01581:1209:000000000-D3YJT:1:1102:17221:11810 1:N:0 147 chrUn_gl000220 115778 2 5=1X
4=1X2=2X18=1X10= = 115771 -51
ATGGGACAGTGGGGGCTGAGAGATGGGCGAGCGGCGTTCCGAAG
1D1BB21B///A/0121B211B000A0A0?AA>1AA1>11>1>1 XT:A:R NM:i:5 AM:i:2

ADD REPLY
0
Entering edit mode

[...]so I believe the problem is generated from the parallel and grep command.

I believe the problem as --ungroup:

 --ungroup allows output to mixup with half a line coming from one job and half a line coming from another job.

fin swimmer

ADD REPLY
0
Entering edit mode

Should I just delete the --ungroup argument ?

ADD REPLY
0
Entering edit mode

Yes, you should do this.

ADD REPLY

Login before adding your answer.

Traffic: 2806 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6