GREP: out of memory
4
0
Entering edit mode
5.8 years ago
jaqx008 ▴ 110

Hello all, I have some .gff file that I am running grep on. for some of the files, grep runs ok but for others, it spits out the error message "grep: out of memory". it seem to be a common problem with grep but I haven't really seen a solution that helps either on this forum or elsewhere. Beside I don't know much about memory management but my machine has 128GB and I don't know why I am getting this error.

Thanks in advance.

(My command is below and it runs fine for some of the files)

grep -f 1.txt 2.gff > 3.gff
Deseq2 macOS gff HEloci • 10k views
ADD COMMENT
1
Entering edit mode

grep -f can indeed consume a ton of memory. It's nice that you received the "out of memory" error, I have crashed a 500Gb RAM server with a nasty grep -f.

ADD REPLY
1
Entering edit mode

jaqx008 : Are you grep'ing something super secret? Can you provide an actual example so we can avoid this endless back and forth? As @ATPoint said there may be an efficient way of doing what ever you are trying to do without using grep.

ADD REPLY
0
Entering edit mode

This is what I am trying to do. there are some gene IDs in a text file A, and lines possibly containing the gene IDs in gff file B. I am trying to identify the matches in B and outpur all the lines that match in B ( this should output all the columns in B. A has only one column of gene IDs B has has multiple colums and one of those columns has the gene IDs. so my command should pull out the corresponding columns of A in B. bellow is the command

$ grep -f A.txt B.gff > AinB.gff

 Is this clear enough? and is there another way to go about this
 without grep? Thanks
ADD REPLY
0
Entering edit mode

Can you give us an idea of the numbers we're talking about here? file sizes? list length ?

For most cases a solution like offered by arup (== loop through your list and grep each of them) will solve your issue

ADD REPLY
0
Entering edit mode

OK. the text files rang between 100bytes to about 50kb with word count around 1500 and in this format

2597857964
2597857966
2597857965
2597857963
2597860200
2597857153
2597857472

while the .gff range from about 600kb to about 900kb with wc of about 5000 to 7000 in the format bellow

gff-version 3

Ga0036900_gi400319889.1 img_core_v400   CDS 109 1620    .   +   0   ID=2597849326;locus_tag=Ga0036900_00001;product=chromosomal replication initiator protein DnaA
Ga0036900_gi400319889.1 img_core_v400   CDS 1660    2763    .   +   0   ID=2597849327;locus_tag=Ga0036900_00002;product=DNA polymerase III, beta subunit
Ga0036900_gi400319889.1 img_core_v400   CDS 2784    3887    .   +   0   ID=2597849328;locus_tag=Ga0036900_00003;product=DNA replication and repair protein RecF
Ga0036900_gi400319889.1 img_core_v400   CDS 3893    6310    .   +   0   ID=2597849329;locus_tag=Ga0036900_00004;product=DNA gyrase subunit B

Also, I did tried the loop but it exits with the error

grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
ADD REPLY
1
Entering edit mode

Where did you get your files from, which operating system? Did you open them in windows or so? I had to look myself but apparently this 'error' is related to the encoding of your file.

If you have dos2unix (or mac2unix) installed, you might run that first on your files to convert them to proper unix encoding

ADD REPLY
0
Entering edit mode

What is output of ulimit -a?

ADD REPLY
0
Entering edit mode

oh I see. well ulimit -a output

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 256
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 709
virtual memory          (kbytes, -v) unlimited
ADD REPLY
0
Entering edit mode

Your account does not have a limit. So the error is due to possibilities enumerated by others.

ADD REPLY
0
Entering edit mode

What are you grep-ing? Maybe a more efficient way could be the use of tabix, depending on what you want to retrieve.

ADD REPLY
0
Entering edit mode

well I cant post more than 5 times in 6 hours and thats why I havent responded. BTW I am grep-ing a text file like

2345
3245
5432
1234

against a .gff file like

4634 - ID=2345 4353 + ID=3245 etc its working for some and giving memory complain for others.

ADD REPLY
0
Entering edit mode

Those patterns are too generic and likely generate many matches. That must be the reason why you are running out of memory. Are you trying to pull out specific genes?

Once you post for a certain number of times (get rep points) that posting limit should go up.

ADD REPLY
0
Entering edit mode

Yes I am trying to pull out certain genes IDs with there corresponding information. I posted in the comment above what my files look like. the thing is, it works fine for some and does not for others.

ADD REPLY
0
Entering edit mode

This section from GNU-parallel may help you in automatic piping of both the files or single file: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Parallel-grep. Manual covers each situation (eg. limited RAM, limited CPU etc)

ADD REPLY
0
Entering edit mode
5.8 years ago
jaqx008 ▴ 110

I was able to resolve this issue. If anyone else runs into the same problem, the command bellow should help. first split files into few lines using

split -l n file.txt

where n= number of lines desired (I did n = 5)

$ for i in x*; do grep -f "$i" file2.gff > "$i".tmp.gff; done && cat *tmp.gff > output.gff
ADD COMMENT
2
Entering edit mode
5.8 years ago

Rather than running at once pass the 1.txt entries one by one.

while read line; do grep -wF "$line" 2.gff; done < 1.txt

Source: https://askubuntu.com/questions/595426/use-a-list-of-words-to-grep-in-an-other-list

ADD COMMENT
3
Entering edit mode

would personally add -m1 to this as it then does not unnecessarily keeps on looking through the file over and over again

-m, --max-count=NUM stop after NUM matches

ADD REPLY
0
Entering edit mode

Hey, Where in the command would you put -m? I mean where in

while read line; do grep -wF "$line" 2.gff; done < 1.txt
ADD REPLY
0
Entering edit mode

before the -wF part.

I must add that I mentioned this in thinking you only needed one line per grep but that's likely not the case. So it might be a bit tricky to use the -m option

ADD REPLY
0
Entering edit mode

It gives a long list that looks like this then terminates aftyer I run the command

grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: output.txt: No such file or directory
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
ADD REPLY
0
Entering edit mode

are you grepping a pattern with spaces in it?

ADD REPLY
1
Entering edit mode
5.8 years ago

I'm no sys-admin myself but I believe there is (or can be) a limit on the amount of mem a command is allowed to use, regardless of the total amount of mem of the server (this to avoid one command killing the server).

I see you're doing a grep -f , so what you can do is to split up your 1.txt file in several parts and run them one of the other and append the result to the 3.gff file

ADD COMMENT
0
Entering edit mode

Sounds like a good plan. I will try this now.

ADD REPLY
0
Entering edit mode

I did this and it worked for one of the txt already but one gave this error. does it mean no match?

grep: empty (sub)expression
ADD REPLY
0
Entering edit mode

no, I'm guessing you might have empty/blank lines in your 1.txt file? or lines with an '|' in them?

ADD REPLY
0
Entering edit mode
5.8 years ago

To have grep do exact-word matches from a file of strings, use -w -F:

$ grep -w -F -f 1.txt 2.gff > 3.gff

The -w option does word searches using regular expressions. The -F option modifies this to do exact-string matching.

Using -w on its own will consume a lot of memory, because it will look for words in 1.txt that are contained in substrings of 2.gff.

In other words, if you have a string like 12345 in 1.txt, then using -w on its own will match a larger set of strings that contain 12345 (what you want), 123456, 1234567, 012345, etc. found in 2.gff.

The first match to 12345 is desired, but all other matches that contain 12345, like 123456 and 1234567 etc. are probably not what you want.

So by combining -w and -F, you get an exact match for the string you provide. So 12345 will only provide a hit on the word 12345, and not any other matches where 12345 is a prefix or suffix or substring of something else.

As a bonus, using -F makes grep consume a great deal less memory. Regular expressions use lots of memory, but string matching does not need to.

ADD COMMENT

Login before adding your answer.

Traffic: 1737 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6