Question

how to retrieve specific raws from a data matrix based on Affymetrix ID in Linux

1

Entering edit mode

9.2 years ago

Mo ▴ 920

Hello,

I have a matrix in which I am trying to retrieve specific rows from it and save it in text

An example matrix is

Affy ID       DDM1       DGKI2      FDGYYY1     GUHIL6
1438_at       0.0635     0.2065     -0.2112     0.0856
1487_at       0.071      -0.1315    0.0263      0.0198
1494_f_at     0.0045     -0.0237    0.0156      -0.1352
1598_g_at     -0.0541    0.0006     -0.1369     -0.0589
160020_at     -0.0925    0.2182     -0.1967     -0.0074
1729_at       -0.0017    -0.2209    -0.086      -0.0709
1773_at       -0.0273    -0.0181    0.1042      0.0136
177_at        -0.0276    -0.2563    0.3975      -0.0535
179_at        -0.0472    0.0979     -0.216      -0.2814
1861_at        0.0121    -0.4038    0.0016      0.0334
200000_s_at   -0.1021    -0.0887    -0.0452     0.0035

Lets say the name of this matrix is M.txt and the selected rows is in a list named mSelected.txt (consisting of 1438_at, 1729_at and 200000_s_at). My output should look like the following file

Affy ID       DDM1       DGKI2      FDGYYY1     GUHIL6
1438_at       0.0635     0.2065     -0.2112     0.0856
1729_at       -0.0017    -0.2209    -0.086      -0.0709
200000_s_at   -0.1021    -0.0887    -0.0452     0.0035

Is there also anyway to convert their Affy ID to Gene name?

$ head /Users/Desktop/mSelected.txt | cat -vet
1438_at^M1729_at^M200000_s_at 

$ head /Users/Desktop/m.txt | cat -vet
Affy ID                             ^I DDM1               ^IDGKI2                ^IFDGYYY1       ^I GUHIL6^M1438_at^I0.0635^I0.2065^I-0.2112^I0.0856^M1487_at^I0.071^I-0.1315^I0.0263^I0.0198^M1494_f_at^I0.0045^I-0.0237^I0.0156^I-0.1352^M1598_g_at^I-0.0541^I0.0006^I-0.1369^I-0.0589^M160020_at^I-0.0925^I0.2182^I-0.1967^I-0.0074^M1729_at^I-0.0017^I-0.2209^I-0.086^I-0.0709^M1773_at^I-0.0273^I-0.0181^I0.1042^I0.0136^M177_at^I-0.0276^I-0.2563^I0.3975^I-0.0535^M179_at^I-0.0472^I0.0979^I-0.216^I-0.2814^M1861_at^I0.0121^I-0.4038^I0.0016^I0.0334^M200000_s_at^I-0.1021^I-0.0887^I-0.0452^I0.0035

affymetrix linux matrix • 3.5k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

0

Entering edit mode

9.2 years ago

Devon Ryan 104k

While you can just use grep to get the lines from the file you want, to convert the probe IDs to something useful, you'll want to (1) load the file into R, (2) install & load the appropriate annotation file from bioconductor and (3) use the select() function followed by either merge() or left_join() from dplyr to actually annotate things.

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks for your message, I am trying to check them out!

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

0

Entering edit mode

9.2 years ago

Ram 43k

grep -f mSelected.txt M.txt >output.txt

Note: This is not the working solution. It's a lead. I should probably just have said "use grep -few"

ADD COMMENT • link 2.0 years ago by Ram 43k

2

Entering edit mode

Potential problems with this one if column 1 values can be found elsewhere. Also, I would include -w in case some ids are sub-strings of other ids..

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by 5heikki 11k

0

Entering edit mode

If I had a nickel for every time I've mucked something up by not using -w with grep...

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

I rarely use grep when the match is a \bPATTERN\b. For some reason, rarely do I get some perfect scenarios to work with.

ADD REPLY • link 2.0 years ago by Ram 43k

0

Entering edit mode

Thanks but by using this line, I can only make an empty output

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

0

Entering edit mode

You'll have to tweak it with -w and such modifiers. You're better off using cat and xargs with the pattern ^<line>\b. This one liner was just supposed to be a lead to a solution, not the solution itself.

ADD REPLY • link 2.0 years ago by Ram 43k

Ram · Accepted Answer · 2015-02-02

2

Entering edit mode

9.2 years ago

5heikki 11k

awk 'FNR==NR{a[$0];next}($1 in a)' mSelected.txt M.txt

If the files are huge, join could be the fastest way:

 join -t "<TAB>*" -1 1 -2 1 -o 1.1,1.2,1.3,1.4,1.5 <(sort -k1,1 M.txt) <(sort -k1,1 mSelected.txt) > output

*literal tab = ctrl-v-tab

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by 5heikki 11k

1

Entering edit mode

Thanks. of course the file is huge and the awk works fine but I don't see any output !!! do you have any idea where the output is saved? Note that I use first Cat for both as follows:

cat mSelected.txt M.txt

then I run your awk line

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

0

Entering edit mode

You don't wanna use cat when the awk command is designed to read from the files. And output is stored in the file that follows the > operator in any UNIX command.

ADD REPLY • link 2.0 years ago by Ram 43k

1

Entering edit mode

I used the following command but the output is empty.

awk 'FNR==NR{a[$0];next}($1 in a)' /User/Mohammad/Desktop/mSelected.txt /User/Mohammad/Desktop/M.txt > output.txt

Do you know where the problem could be?

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

0

Entering edit mode

Could you give us the output of:

head /User/Mohammad/Desktop/mSelected.txt | cat -A

and

head /User/Mohammad/Desktop/M.txt | cat -A

ADD REPLY • link 2.0 years ago by Ram 43k

0

Entering edit mode

How about

head /User/Mohammad/Desktop/mSelected.txt | cat -vet
head /User/Mohammad/Desktop/M.txt | cat -vet

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by 5heikki 11k

1

Entering edit mode

Is that a Mac cat? I knew there was a reason I always used cat -te and not cat -A

ADD REPLY • link 2.0 years ago by Ram 43k

2

Entering edit mode

brew install coreutilsmakes life with Mac OS X so much easier..

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by 5heikki 11k

1

Entering edit mode

Yep. That's the reason I don't remember BSD specific syntaxes. I have homebrew managing GNU-coreutils with bash on my Mac.

ADD REPLY • link 2.0 years ago by Ram 43k

0

Entering edit mode

I still cannot reply your comment.

I do use TextWrangler but my real data is HUGE and I am afraid I won't be able to paste or even open it. I am searching for a way to make it right

I used

sed 's/^M$//' M.txt > MCorrected.txt
sed 's/^M$//' mSelected.txt > mSelectedCorrected.txt

then I used

awk 'FNR==NR{a[$0];next}($1 in a)' /User/Mohammad/Desktop/mSelectedCorrected.txt /User/Mohammad/Desktop/MCorrected.txt > output.txt

The same as before, I get an empty output

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

1

Entering edit mode

You can use pastebin or GitHub gist

ADD REPLY • link 2.0 years ago by Ram 43k

1

Entering edit mode

You'd need to use extended sed with the regular expression. Also, your regex cannot be ^M$ if you wish to match a ^M character at the end of the line, because ^ is a meta for the beginning of the line. Use sed -e 's/\r//' input >output

Also, you should not need to open the entire file to just pick the top 10 lines.

ADD REPLY • link 2.0 years ago by Ram 43k

0

Entering edit mode

It did not work RamRS , still getting an empty output

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

1

Entering edit mode

Just post the whole file somewhere (dropbox, google drive, etc.) so someone can just directly determine what you need to do to clean it. The amount of effort that others have put into doing this remotely is a bit excessive.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Devon Ryan 104k

1

Entering edit mode

I found where the problem was and I solved it! Thank you very much

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Mo ▴ 920

0

Entering edit mode

I wish this had been heeded by OP: how to retrieve specific raws from a data matrix based on Affymetrix ID in Linux

ADD REPLY • link 2.0 years ago by Ram 43k

0

Entering edit mode

I'd missed that in the deluge of comments :P

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

I think everyone did. I really think we need to add instructions on Add Comment/Answer/Post to use Gist or Pastebin to add files that are ~~too large~~ not optimal for the viewer here. Plus, they also have better formatting, syntax highlighting and line numbering, so I'd prefer that any day over pasting long code here.

ADD REPLY • link 2.0 years ago by Ram 43k

0

Entering edit mode

Did you try to convert the files by any other way? You could e.g. try the "tr method" from my link. You are doing science, show some initiative.. Then when you do the head command on the corrected M.txt, it should look like:

value^Ivalue^Ivalue^Ivalue^Ivalue$
value^Ivalue^Ivalue^Ivalue^Ivalue$
value^Ivalue^Ivalue^Ivalue^Ivalue$

While the corrected mSelected.txt should look like:

value$
value$

And then if the awk command still returns empty then it means that the IDs from mSelected.txt do not exist in the first column of M.txt.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by 5heikki 11k

0

Entering edit mode

Your files do not appear to have LF end of line markers. So here, e.g. awk sees only one line in your mSelected.txt file that contains all three patterns, when they should be on separate lines. Here are some ways for converting your files, or if you have e.g. TextWrangler installed you can do it there..

sed 's/^M$//' M.txt > MCorrected.txt
sed 's/^M$//' mSelected.txt > mSelectedCorrected.txt

And then use the corrected files for the awk command. Should work..

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by 5heikki 11k

0

Entering edit mode

~~That might be my mistake (during moderation copy-paste). I'll change that now.~~

Nope, OP's file looks like it has ^M characters plus a mix of tabs and spaces. Will def need a bit of cleaning.

ADD REPLY • link 2.0 years ago by Ram 43k

0

Entering edit mode

I don't think you can replace ^M with a literal ^M. You might have to use \r

ADD REPLY • link 2.0 years ago by Ram 43k