Question

Remove duplicates in an extremely large text file

3

Entering edit mode

5.2 years ago

OAJn8634 ▴ 60

I have a very large text file (9.2GB) that contains data in two columns (named All_SNPs.txt). I would like to use this file to convert my chr:pos to rs in my plink files. The file looks like this:

1:116342    rs1000277323
1:173516    rs1000447106
1:168592    rs1000479828
1:102498    rs1000493007

However, plink produced an error stating that All_SNPs.txt contains duplicates in column 1. I have tried different ways of removing the duplicates; specifically, I attempted the following line commands:

awk '!x[$1]++ { print $1,$2 }' All_SNPs.txt > All_SNPs_nodup.txt

sort  All_SNPs.txt | uniq -u > All_SNPs_nodup.txt

cat All_SNPs.txt | sort | uniq -u > All_SNPs_nodup.txt

However, each time I faced the same two problems: 1) either, the code will not make any difference and I will receive the same file back (this is for line commands that uses cat) or 2) I will receive an error : the procedures halted: have exceeded (this is for sort | uniq, and awk).

I will be very grateful for any ideas of how I can make this work. Thank you very much.

PS, this file is far too large to open it in R.

awk uniq plink SNP • 6.9k views

ADD COMMENT • link updated 3.3 years ago by Biostar 20 • written 5.2 years ago by OAJn8634 ▴ 60

2

Entering edit mode

You could try to convert the input file to a valid vcf and use than bcftools sort and bcftools norm -N -d none to remove the duplicates. At the end you can convert back to the input format.

ADD REPLY • link 5.2 years ago by finswimmer 16k

0

Entering edit mode

Using datamash:

 datamash -sg 1 unique 2  <test.txt

datamash is available in brew, conda, apt repos.

using tsv-utils :

tsv-uniq --ignore-case -H -f 1 test.txt

ADD REPLY • link 5.2 years ago by cpad0112 21k

0

Entering edit mode

Also try with a hashtable, like a python dict. No idea how much memory it would require but you can try if nothing else works.

ADD REPLY • link 5.2 years ago by GouthamAtla 12k

0

Entering edit mode

5.2 years ago

Benn 8.3k

Did you try:

awk '!seen[$1]++' All_SNPs.txt > All_SNPs_nodup.txt

If it doesn't work I think you need better hardware...

ADD COMMENT • link 5.2 years ago by Benn 8.3k

0

Entering edit mode

OP tried: awk '!x[$1]++ { print $1,$2 }' All_SNPs.txt > All_SNPs_nodup.txt @ b.nota

ADD REPLY • link 5.2 years ago by cpad0112 21k

0

Entering edit mode

Thank you for your suggestion. I have tried this command a few times but unfortunately I get an error: Cannot allocate memory

ADD REPLY • link 5.2 years ago by OAJn8634 ▴ 60

0

Entering edit mode

5.2 years ago

Santosh Anand 5.7k

Plink has basic mechanism to deal with dups

--list-duplicate-vars <require-same-ref> <ids-only> <suppress-first>

https://www.cog-genomics.org/plink/1.9/data#list_duplicate_vars

Then the duplicated vars can be excluded using --exclude plink.dupvar

ADD COMMENT • link 5.2 years ago by Santosh Anand 5.7k

score 5 · Accepted Answer · 2019-02-05

Cut and uniq to find duplicates, then grep -v them away. Relatively quick on 389MB dummy file.

time sort -V All_SNPs.txt | cut -f 1 | uniq -c | perl -ane 'if($F[0] ne "1"){print "$F[1]\t$F[0]\n";'} > All_SNPs.dup.chr-pos.txt

real    0m57.428s
user    3m19.091s
sys     0m3.152s

time cut -f 1 All_SNPs.dup.chr-pos.txt | grep -wvf - All_SNPs.txt  > All_SNPs.nodup.txt

real    0m2.516s
user    0m1.697s
sys     0m0.229s

score 2 · Accepted Answer · 2019-02-05

2

Entering edit mode

5.2 years ago

Shred ★ 1.4k

Split the text file into smaller one using the split command. You could split by size, as example:

split -b 200m filename

This will produce files named ' xaa, xab, xac..' . Now use awk, but with simpler syntax

awk -F"\t" '!seen[$1]++' xa*

And after that, join files using a sample cat into the destination file.

ADD COMMENT • link 5.2 years ago by Shred ★ 1.4k

2

Entering edit mode

How big is the chance that two dups end up in different files?

ADD REPLY • link 5.2 years ago by Benn 8.3k

2

Entering edit mode

You can use awk to split your file according to the chromosomes:

awk '{ split($1, a, ":"); print $1"\t"$2 >> a[1]".txt"; }' All_SNPs.txt)

That assures you that you don't miss two dups. If you like to reduce the file size a bit, you can remove the chr: :

awk '{ split($1, a, ":"); print a[2]"\t"$2 >> a[1]".txt"; }' All_SNPs.txt)

This will also allow you to keep a bit more items in the hash.

ADD REPLY • link 5.2 years ago by michael.ante ★ 3.8k

0

Entering edit mode

Shit happens. But if a file is too large for ram, in bash there's no way to map into memory. Maybe a solution would be in python, as explained here

ADD REPLY • link 5.2 years ago by Shred ★ 1.4k

0

Entering edit mode

Yeah the link is worth a try, or here another awk solution. Can't test it myself, though.

ADD REPLY • link 5.2 years ago by Benn 8.3k

0

Entering edit mode

If you sort first I guess the chance is very small no?

ADD REPLY • link 5.2 years ago by Gautier Richard ▴ 370

0

Entering edit mode

Sort loads file into memory too.

ADD REPLY • link 5.2 years ago by Shred ★ 1.4k

0

Entering edit mode

Do you actually have this as a PLINK dataset? Why not try to use PLINK functionality to update the map file? For example, --list-duplicate-vars lists duplicates, which can then be excluded