grep unexpected behaviour
3
2
Entering edit mode
9.1 years ago
Illinu ▴ 110

I have a list of sequence names:

c32026_g2_i1
c43297_g1_i9
c45863_g2_i2
c43297_g1_i10
c35765_g2_i1
c44444_g3_i1
...

I want to take each sequence from the list and output the annotation of that sequence extracting it from the annotations file

This is what I do and it doesn't work well, for example in the list there is:

c38478_g1_i5
c38478_g1_i4
c38478_g1_i17
c38478_g1_i9
c38478_g1_i18
c38478_g1_i1

and it outputs:

c38478_g1_i5    1018    Q3B724    ...
c38478_g1_i4    1000    Q3B724    ...
c38478_g1_i17    887    Q3B724    ...
c38478_g1_i9    1007    Q3B724    ...
c38478_g1_i18    738    Q3B724    ...
c38478_g1_i1    496    -    -    -    - ...
c38478_g1_i10    950    -    -    -   ...
c38478_g1_i11    249    Q3B724    ...
c38478_g1_i12    706    -    -    -    ...
c38478_g1_i13    654    -    -    -    ...
c38478_g1_i14    809    -    -    -  ...
c38478_g1_i15    217    -    -    -  ...
c38478_g1_i16    788    Q3B724    ...
c38478_g1_i17    887    Q3B724    ...
c38478_g1_i18    738    Q3B724    ...
c38478_g1_i19    548    -    -    -    ...
for f in $(cat list.OE.txt); do grep $f Trinity_uniref_2015_02_filt_ann_out.txt; done > OE.annocript.txt
grep • 2.5k views
ADD COMMENT
3
Entering edit mode

What are the expected input and output? If you are just greping the list from a file, and your list are store in a file, let's say, list.txt, then you can always do

grep -wf list.txt Trinity_uniref_2015_02_filt_ann_out.txt > OE.annocript.txt
ADD REPLY
1
Entering edit mode

You need "-wFf". When list.txt is huge, "-F" will be much faster.

ADD REPLY
0
Entering edit mode

Yes, that worked :) Thank you

ADD REPLY
3
Entering edit mode
9.1 years ago
5heikki 11k

This is expected behavior since you didn't use the -w flag. man grep

For example,

grep c38478_g1_i1

Would return such lines:

c38478_g1_i11
c38478_g1_i1111111
c38478_g1_i122343454
as_long_as_c38478_g1_i1_is_on_the_line
ADD COMMENT
0
Entering edit mode

Great, I see! thank you

ADD REPLY
2
Entering edit mode
9.1 years ago
tszn1984 ▴ 100

Avoid grep in a loop. This is O(MxN) time complexity.

A more general solution: use hash to store the ids, then check if the data file has that id in a specific column.

awk 'BEGIN{FS=OFS="\t";while(getline<"ids.lst") ids[$1] =1}{if(ids[$1]==1) print}' full_anno.txt
ADD COMMENT
0
Entering edit mode

A more general description for such case:

  • Data file: a matrix file, with ID in the Nth column
  • ID list file: a list of IDs
>selectIn datafile  IDfile N

selectIn source file

###################
#!/bin/sh
#Last-modified: 13 Apr 2013 03:02:05 PM
USAGE=" Usage: $0 Data.txt id.lst [col=1]"
case $# in
    0) echo $USAGE
       exit;;
    1) echo $USAGE
       exit;;
    2) col=1;;
    *) col=$3;;
esac
data=$1
awk -vLST=$2 -vCOL=$col 'BEGIN{FS=OFS="\t";while(getline<LST) count[$1]=1}{if(count[$(COL)]==1) print}' $data
###################
ADD REPLY
0
Entering edit mode
9.1 years ago
Illinu ▴ 110

Just to update, this form:

for f in $(cat list.OE.txt); do grep -w $f Trinity_uniref_2015_02_filt_ann_out.txt; done > OE.annocript.txt

works A LOT faster than this one:

grep -wf list.OE.txt Trinity_uniref_2015_02_filt_ann_out.txt > OE.annocript.txt
ADD COMMENT
0
Entering edit mode

If you have large files it will be extremely slow anyway since you're searching the entire file again and again. For such tasks, there's man join

ADD REPLY

Login before adding your answer.

Traffic: 1468 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6