comparing two lists using command line tools
1
1
Entering edit mode
7.4 years ago
Farbod ★ 3.4k

Dear Friends, Hi (Sorry if this question is simple or duplicated)

I have two lists of IDs, list A.txt and list B.txt and B>A (A=blast results IDs and B= the IDs from the original EST database that the blast has run against it).

I want to compare them and collect the IDs that are present in the list B (bigger list) but absent in the list A (the ESTs that blast can not find any hit for them).

Please help me how to do it in linux command line (please no perl or python. thanks)

NOTE1: I usually do it as below, but I need some more strigth approach:

1- sort both lists
2- $ comm A.txt B.txt | cut –f2 > specific-to-B.txt
3- $ sed -i '/^\s*$/d'  specific-to-B.txt (because this file contain many blank lines)
4- the list is ready

NOTE2: Example of list is as :

EV825482.1

EV825573.1

EV825616.1

EV825623.1

EV825663.1

EV825667.1

EV825673.1

EV825677.1

EV825680.1

sequence blast BASH • 6.0k views
ADD COMMENT
1
Entering edit mode

I want to compare them and collect the IDs that are present that are present in the list B (bigger list) but absent in the list A (the ESTs that blast can not find any hit for them).

from the comm manual :

   -1     suppress column 1 (lines unique to FILE1)

   -2     suppress column 2 (lines unique to FILE2)

   -3     suppress column 3 (lines that appear in both files)

because this file contain many blank lines

WHY ??

ADD REPLY
0
Entering edit mode

Dear Pierre Lindenbaum, Hi

When I use comm -1 fileA fileB | wc -l , it returned the fileB number (6088)

When I use comm -2 fileA fileB | wc -l , it returned the fileA number (5699)

When I use comm -3 fileA fileB | wc -l , it returned the number I want (389)

So I think the "suppress column 3 (lines that appear in both files)" must be changed to "the lines that are in B and not in A"

Is it correct ?

~Je vous remercie infiniment

ADD REPLY
0
Entering edit mode

Is it correct ?

no, the correct way for ' present in the list B but absent in the list A ' would be: ('-1' remove lines unique to listA and '-3' lines that appear in A and B )

comm -13 fileA fileB
ADD REPLY
1
Entering edit mode

is it normal ?

yes if there is no line unique to fileA.

But again, the correct way is 'comm -13'

ADD REPLY
0
Entering edit mode

OK Pierre,

that was the problem (or better to say, the "reason") as the fileA IDs (blast results) is exactly exist in listB (blast database !

You are right as usual, ;-)

~ bon courage

ADD REPLY
0
Entering edit mode

Thanks for the time and supports,

when I perform "comm -3 fileA fileB | wc -l" and "comm -13 fileA fileB" , both the results have 389 lines,

is it normal ?

ADD REPLY
0
Entering edit mode

Are those results identical?

ADD REPLY
0
Entering edit mode

Hi,

Good question

And the answer is positive.

ADD REPLY
0
Entering edit mode

Hi Farbod,

I know it's not what you're asking. But I do this exact thing in Galaxy all the time. Galaxy wraps common command line tools, but you have a nice graphical interface to work in and the results are really easy to visualise. There are a couple of text manipulation tools in there that you could do this with.

Thought I'd put a comment here just in case your interested :-)

ADD REPLY
0
Entering edit mode

Dear Ando.kelli,

Hi and thank you for your clever advice,

would you name some of the programs you have used for this purpose in Galaxy, please ?

ADD REPLY
0
Entering edit mode

Hi Farbod,

Sorry for the slow reply, I didn't get a notification saying that you responded to my comment.

If you install a local instance of Galaxy there are many text manipulation tools that are automatically included, and many that can be downloaded.

List of available tools can be found at the main Galaxy Toolshed: https://toolshed.g2.bx.psu.edu/

You can go to this site and browse tools. Alternatively, you can go to this site: https://usegalaxy.org/ and browse the list of options down the left hand side. I think the headings most relevant to you are: Text Manipulation, Convert Formats, Join Subtract Group, and Filter and Sort.

Hope that helps.

Kelli

ADD REPLY
0
Entering edit mode
7.4 years ago
Benn 8.3k

Why don't you use uniq?

cat fileA fileB | sort | uniq -d
ADD COMMENT
0
Entering edit mode

Dear b.nota, Hi and thanks

With my script, the wc -l return 389 and with your script it shows 5699 (which is fileA line number) ;-)

ADD REPLY
0
Entering edit mode

Sorry, you're right. I thought you wanted the overlap... My bad!

ADD REPLY
1
Entering edit mode

Hi,

Not at all, it was a useful script for me,

Thanks

ADD REPLY

Login before adding your answer.

Traffic: 2028 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6