compare two files and print unique values to a new file
3
0
Entering edit mode
8.8 years ago
HumeMarx ▴ 40

I am trying to compare two (or more) files, containing chromosomal positions in the form 2:282828282828, there are about 70,000 of these positions, and whilst Excel works for smaller datasets, it is causing me too much trouble on this scale.

What I was trying to do is

1) compare each position in the file with every position in the other file, and if the value is unique, print that value to a new file

OR If that's not possible, I can merge the two files in to one column so that I can:

2) compare all cells in a merged file, and print the unique values (not their index) in to a different file.

I am a little stuck as to where to start for the first option

But for the second option I am using MATLAB (as I am not sure where to even start with R) and I was thinking of this code:

a=data1.txt
[n, bin] = histc(a, unique(a)); multiple = find(n > 1);

to find the multiple values, but how do I get MATLAB to write these unique values where n=1 to a new file?

I really appreciate any help you may be able to give. I am trying to learn MATLAB and R and Plink at the same time as actually doing the work.

MATLAB R • 17k views
ADD COMMENT
2
Entering edit mode
8.8 years ago

On linux:

comm -3 <(sort file1.bed | uniq) <(sort file2.bed | uniq)
ADD COMMENT
0
Entering edit mode

Is it possible to make it save the output in to a file? It works well, but just prints to screen.

I think I am mixing this with plink, but I wrote this -write-txt >file3 after the command in an attempt to write to file.

ADD REPLY
1
Entering edit mode

Just > file3

ADD REPLY
0
Entering edit mode

Thanks

ADD REPLY
1
Entering edit mode
8.8 years ago

If you have one chromosomal position per line. You can just do it in few lines in R.

file1<-readLines("file1.txt&quot;) ## Read the lines in the file
file2<-readLines("file2.txt")
uniqueData<-unique(c(file1,file2)) ## Merge and get the unique ones
write(uniqueData,file="newFile3.txt") ## Write to a new file
ADD COMMENT
0
Entering edit mode

I tried this, thanks.

But for some reason it says

Error in cat(list(...), file, sep, fill, labels, append) :
 argument 1 (type 'list') cannot be handled by 'cat'.

I tried write.table(uniqueData,file="newFile3.txt") and tried to view it but I get this

Error in View : object 'newFile3' not found

any idea why the write(unique..... ) command is not working?

ADD REPLY
0
Entering edit mode

It works perfectly fine when you are comparing the lines in two files. If you are adding the above scripts somewhere in the middle of your script, its easier to help if you post your script.

From the first error message, I see that you are trying to write a list to a file which is not allowed in R. To use data.table(), it should be either matrix or data frame. From the second error message, I understand that you forgot quotes for your filename.

If it does not help, add your codes. Then, I can see.

ADD REPLY
0
Entering edit mode

I am sure its something obviously wrong, but I just don't know how to pick it up yet.

I assumed that because I have already loaded the files there is no need to "read" the files again.

So I ran the command:

uniqueData<-unique(c(data1c,data 3file2))
###and this works as I am getting an output called uniqueData
###but when I try to write it to: 
write(uniqueData,file="newFile3.txt")

I get this error message:

Error in cat(list(...), file, sep, fill, labels, append) :
      argument 1 (type 'list') cannot be handled by 'cat'
ADD REPLY
0
Entering edit mode

and when I try to "read" the files in I get this:

file1<-readLines("test1c.txt")
Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
ADD REPLY
1
Entering edit mode

What is your working directory? You have to give the complete file path if your file is not in your working directory.

ADD REPLY
1
Entering edit mode
8.8 years ago
5heikki 11k

The *nix command line is very efficient for this sort of stuff..

COMM(1)                          User Commands                         COMM(1)

NAME
       comm - compare two sorted files line by line

SYNOPSIS
       comm [OPTION]... FILE1 FILE2

DESCRIPTION
       Compare sorted files FILE1 and FILE2 line by line.
       With no options, produce three-column output.  Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files.
       -1     suppress column 1 (lines unique to FILE1)
       -2     suppress column 2 (lines unique to FILE2)
       -3     suppress column 3 (lines that appear in both files)
       --check-order
              check that the input is correctly sorted, even if all input lines are pairable
       --nocheck-order
              do not check that the input is correctly sorted
       --output-delimiter=STR
              separate columns with STR
       --help display this help and exit
       --version
              output version information and exit
       Note, comparisons honor the rules specified by 'LC_COLLATE'.

EXAMPLES
       comm -12 file1 file2
              Print only lines present in both file1 and file2.
       comm -3
              file1 file2  Print lines in file1 not in file2, and vice versa.

AUTHOR
       Written by Richard M. Stallman and David MacKenzie.

REPORTING BUGS
       Report comm bugs to bug-coreutils@gnu.org
       GNU coreutils home page: <http://www.gnu.org/software/coreutils/>
       General help using GNU software: <http://www.gnu.org/gethelp/>
       Report comm translation bugs to <http://translationproject.org/team/>

COPYRIGHT
       Copyright © 2010 Free Software Foundation, Inc.  License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
       This is free software: you are free to change and redistribute it.  There is NO WARRANTY, to the extent permitted by law.

SEE ALSO
       join(1), uniq(1)
       The full documentation for comm is maintained as a Texinfo manual.  If the info and comm programs are properly installed at your site, the command
              info coreutils 'comm invocation'
       should give you access to the complete manual.
GNU coreutils 8.4                 March 2014                           COMM(1)
(END)
ADD COMMENT
0
Entering edit mode

Thanks, will try this.

ADD REPLY

Login before adding your answer.

Traffic: 2477 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6