How to write a code in Linux to manipulate a large file with grep/awk
1
0
Entering edit mode
5.2 years ago
OAJn8634 ▴ 60

Following the discussion a previous discussion (https://www.biostars.org/p/171557/), I would like to prepare a file for converting chr:pos to rs. For this, I have downloaded a list of all SNPs from UCSC Table Browser (https://genome.ucsc.edu/cgi-bin/hgTables). It looks like this:

#chrom  chromEnd    name
chr1    100663297   rs1235665665
chr1    1048577     rs1346354302
chr1    62914560    rs538775156

I would like it to look like this:

1:100663297 rs1235665665
1:1048577   rs1346354302
1:62914560  rs538775156

I could do it in R; however, the file is so large that it crashes my computer before it even loads. I was told that Linux commands such as grep and awk are amazing for such things, and can handle very large file efficiently. Unfortunately, I do not have a slightest idea of how even to begin writing the code in Linux to achieve my goal. Could you please help me with this? Thank you very much.

PS I am unsure whether the title to my question is efficient in describing my problem. Please let me know if it is not and I will edit it.

linux R rs chr awk • 1.2k views
ADD COMMENT
3
Entering edit mode

This is really basic question. You should starting searching the web for how awk works. Have a look for examples at this tutorial side.

Of course I could give you the solution. But then you will not learn that much :)

ADD REPLY
3
Entering edit mode

Edit : Along to finswimmer comment, here are some other links

Hello OAJn8634

Please take a look at the man of awk, sed, grep, cut to see how to use it. Give us your best try and we will take a look how to modify it to fit your attent

Awk in Bioinformatics

https://stackoverflow.com/questions/29275971/need-to-remove-the-string-chr-and-the-sign-from-the-file

ADD REPLY
0
Entering edit mode

Hello Bastien Hervé and finswimmer, Thank you very much for the very useful links on how get me started, and for offering to help. I really appreciate it.

ADD REPLY
2
Entering edit mode
5.2 years ago

One way:

$ tail -n+2 in.txt | sed 's/^chr//' | awk -v OFS="\t" '{ print $1":"$2, $3 }' > out.txt

Test it out, first:

$ head -10 in.txt | tail -n+2 | ... > out.txt

Break things down by seeing how a snippet of your file looks after each step.

ADD COMMENT
1
Entering edit mode

Alternatively:

sed -e '1,1d' -e 's/^chr//' in > out

Or e.g.

awk 'BEGIN{OFS=FS="\t"}NR>1{sub(/^chr/,"",$1);print $0}' in > out

Edit. I didn't notice the ":" requirement. These don't actually do that..

Corrected:

sed -e '1,1d' -e 's/^chr//' -e 's/\t/:/' in > out

awk 'BEGIN{OFS=FS="\t"}NR>1{sub(/^chr/,"",$1);print $1":",$2,$3}' in > out
ADD REPLY
0
Entering edit mode

Thank you very much 5heikki for your help.

ADD REPLY
0
Entering edit mode

Dear Alex, Thank you so much for your help. It has worked like magic! Thank you

ADD REPLY
1
Entering edit mode

Let's see what you have learned: Please describe what this command line is doing. :)

ADD REPLY

Login before adding your answer.

Traffic: 1954 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6