Biostar Beta. Not for public use.
grep unexpected behaviour
2
Entering edit mode
19 months ago
Illinu • 90
Belgium

I have a list of sequence names:

c32026_g2_i1
c43297_g1_i9
c45863_g2_i2
c43297_g1_i10
c35765_g2_i1
c44444_g3_i1
...

I want to take each sequence from the list and output the annotation of that sequence extracting it from the annotations file

This is what I do and it doesn't work well, for example in the list there is:

c38478_g1_i5
c38478_g1_i4
c38478_g1_i17
c38478_g1_i9
c38478_g1_i18
c38478_g1_i1

and it outputs:

c38478_g1_i5    1018    Q3B724    ...
c38478_g1_i4    1000    Q3B724    ...
c38478_g1_i17    887    Q3B724    ...
c38478_g1_i9    1007    Q3B724    ...
c38478_g1_i18    738    Q3B724    ...
c38478_g1_i1    496    -    -    -    - ...
c38478_g1_i10    950    -    -    -   ...
c38478_g1_i11    249    Q3B724    ...
c38478_g1_i12    706    -    -    -    ...
c38478_g1_i13    654    -    -    -    ...
c38478_g1_i14    809    -    -    -  ...
c38478_g1_i15    217    -    -    -  ...
c38478_g1_i16    788    Q3B724    ...
c38478_g1_i17    887    Q3B724    ...
c38478_g1_i18    738    Q3B724    ...
c38478_g1_i19    548    -    -    -    ...




for f in $(cat list.OE.txt); do grep $f Trinity_uniref_2015_02_filt_ann_out.txt; done > OE.annocript.txt
grep • 1.1k views
ADD COMMENTlink
3
Entering edit mode

What are the expected input and output? If you are just greping the list from a file, and your list are store in a file, let's say, list.txt, then you can always do

grep -wf list.txt Trinity_uniref_2015_02_filt_ann_out.txt > OE.annocript.txt
ADD REPLYlink
1
Entering edit mode

You need "-wFf". When list.txt is huge, "-F" will be much faster.

ADD REPLYlink
0
Entering edit mode

Yes, that worked :) Thank you

ADD REPLYlink
3
Entering edit mode
17 months ago
5heikki 8.4k
Finland

This is expected behavior since you didn't use the -w flag. man grep

For example,

grep c38478_g1_i1

Would return such lines:

c38478_g1_i11
c38478_g1_i1111111
c38478_g1_i122343454
as_long_as_c38478_g1_i1_is_on_the_line
ADD COMMENTlink
0
Entering edit mode

Great, I see! thank you

ADD REPLYlink
2
Entering edit mode
3.7 years ago
tszn1984 • 90
United States

Avoid grep in a loop. This is O(MxN) time complexity.

A more general solution: use hash to store the ids, then check if the data file has that id in a specific column.

awk 'BEGIN{FS=OFS="\t";while(getline<"ids.lst") ids[$1] =1}{if(ids[$1]==1) print}' full_anno.txt

ADD COMMENTlink
0
Entering edit mode

A more general description for such case:
Data file: a matrix file, with ID in the Nth column
ID list file: a list of IDs

selectIn datafile IDfile N

selectIn source file

#

!/bin/sh

Last-modified: 13 Apr 2013 03:02:05 PM

USAGE=" Usage: $0 Data.txt id.lst [col=1]"
case $# in
0) echo $USAGE
exit;;
1) echo $USAGE
exit;;
2) col=1;;
*) col=$3;;
esac
data=$1
awk -vLST=$2 -vCOL=$col 'BEGIN{FS=OFS="\t";while(getline<LST) count[$1]=1}{if(count[$(COL)]==1) print}' $data

#
ADD REPLYlink
0
Entering edit mode
19 months ago
Illinu • 90
Belgium

Just to update, this form:

for f in $(cat list.OE.txt); do grep -w $f Trinity_uniref_2015_02_filt_ann_out.txt; done > OE.annocript.txt

works A LOT faster than this one:

grep -wf list.OE.txt Trinity_uniref_2015_02_filt_ann_out.txt > OE.annocript.txt
ADD COMMENTlink
0
Entering edit mode

If you have large files it will be extremely slow anyway since you're searching the entire file again and again. For such tasks, there's man join

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1