Biostar Beta. Not for public use.
bash and awk code
1
Entering edit mode
17 months ago
Sam • 110

Hello all, I'm new in bash. could you help me how could I solve this small technical problem ? I want to write a small script for copying the miRNAs name on the left of each sequence until a new one is found. file is in CSV and tab delimited format. Thanks

mir162
,ATAAACCTCTGCATCCAG
----------
,CGATAAACCTCTGCATCC
----------
,CGATAAACCTCTGCATCCAG
----------
mir172
----------
,AGAATCTTGATGATGCTGC


convert to :

----------
mir162,ATAAACCTCTGCATCCAG
----------
mir162,CGATAAACCTCTGCATCC
----------
mir162,CGATAAACCTCTGCATCCAG
----------
mir172,AGAATCTTGATGATGCTGC

2
Entering edit mode

I added markup to your post for increased readability. I hope the format is correct as displayed. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

4
Entering edit mode
17 months ago
Republic of Ireland

Does your data have those dashed lines in it? It didn't when I first looked.

paste -d"\0" <(awk '/^mir/{mir=$0}; /,[ATGC]*/{print mir}' MyMirData.txt) <(grep -v -e "^mir" MyMirData.txt) mir162,ATAAACCTCTGCATCCAG mir162,CGATAAACCTCTGCATCC mir162,CGATAAACCTCTGCATCCAG mir172,AGAATCTTGATGATGCTGC  ...or: awk '/^mir/{mir=$0}; /,[ATGC]*/{printf mir $0 "\n"}' MyMirData.txt mir162,ATAAACCTCTGCATCCAG mir162,CGATAAACCTCTGCATCC mir162,CGATAAACCTCTGCATCCAG mir172,AGAATCTTGATGATGCTGC  # With the dashed lines paste -d"\0" <(awk '/^mir/{mir=$0}; /,[ATGC]*/{print mir}' MyMirDataWithDashes.txt) <(grep -v -e "^mir" -v -e "--" MyMirDataWithDashes.txt)
mir162,ATAAACCTCTGCATCCAG
mir162,CGATAAACCTCTGCATCC
mir162,CGATAAACCTCTGCATCCAG
mir172,AGAATCTTGATGATGCTGC


...or:

awk '/^mir/{mir=$0}; /,[ATGC]*/{printf mir$0 "\n"}' MyMirDataWithDashes.txt
mir162,ATAAACCTCTGCATCCAG
mir162,CGATAAACCTCTGCATCC
mir162,CGATAAACCTCTGCATCCAG
mir172,AGAATCTTGATGATGCTGC

0
Entering edit mode

Hey kevin, im not so good in awk and i would like to improve, could you explain this?

awk '/^mir/{mir=$0}; /,[ATGC]*/{printf mir$0 "\n"}'

3
Entering edit mode

Sure, my friend. AWK is a very powerful program. It was/is its own programming language.

The awk code above will be executed on each line in the input file.

The first part, /^mir/{mir=$0}, is saying: if the line begins with 'mir' (the carat ^ is a wild character that indicates beginning of a line), then assign the value of the line ($0) to a variable called 'mir'.

The second part, /,[ATGC]*/{printf mir $0 "\n"}, is saying: if the line contains a comma followed by any number and combination of ATGC (indicated by ,[ATGC]*), then print the value held in the 'mir' variable followed by the entre line and an end-line ('\n'). I probably could have also used ^,[ATGC]*$ here, with the ^ and $ being wild characters for the beginning and end of the lines. The square brackets indicate any of the letters in the brackets, and in any order. Due to the way that this is structured, the 'mir' variable will only change when the net mir is encountered in the input file. ADD REPLYlink 0 Entering edit mode hi Kevin could you modify first part to have organism and mir name in same line ? ADD REPLYlink 0 Entering edit mode Sure, can you let me see the exact input? ADD REPLYlink 0 Entering edit mode input file format: Organism: aau, ,mir162 ,,ATAAACCTCTGCATCCAG ,,CGATAAACCTCTGCATCC convert to : Organism: aau, mir162,,ATAAACCTCTGCATCCAG Organism: aau, mir162,,CGATAAACCTCTGCATCC  ADD REPLYlink 1 Entering edit mode cat MyMirData.txt Organism: aau, ,mir162 ,,ATAAACCTCTGCATCCAG ,,CGATAAACCTCTGCATCC Organism: aau, ,mir182 ,,CCTCTGCATCCAG ,,AACCTCTGCATCC sed 's/[[:space:]]\+,//' MyMirData.txt | awk '/^Organism/{varOrg=$0}; /^mir/{varMir=$0}; /^,[ATGC]*$/{print varOrg varMir $0}' | sed 's/,/, /g' Organism: aau, mir162, ATAAACCTCTGCATCCAG Organism: aau, mir162, CGATAAACCTCTGCATCC Organism: aau, mir182, CCTCTGCATCCAG Organism: aau, mir182, AACCTCTGCATCC  ADD REPLYlink 0 Entering edit mode Thats very helpful man! thanks! so the two conditions are separated by ";" ?? Whats the use of "printf" here? you could have had the same output with  awk '/^mir/{mir=$0};/,[ATGC]*/{print mir $0}'  ADD REPLYlink 0 Entering edit mode I have posted the answer above, which should work with the 'Organism' included. It just involved an extra part to the awk command, but I also remove all leading whitespace from the file with sed before piping into awk. Yes, you are quite correct regarding print and printf. I did not have to use printf. ADD REPLYlink 0 Entering edit mode Hi Kevin , how I can revise this code to consider "let" and "bantam" microRNAs ? because this code is just for mir and I have other miRNA such as let-7 and bantam-3p. Thanks for your help ADD REPLYlink 2 Entering edit mode Hi Sam, Assuming that everything is in the same format, this should function correctly: sed 's/[[:space:]]\+,//' MyMirData.txt | awk '/^Organism/{varOrg=$0}; /^mir|^let|^bantam/{varMir=$0}; /^,[ATGC]*$/{print varOrg varMir $0}' | sed 's/,/, /g'  The only part that I changed is the matching pattern for awk: /^mir|^let|^bantam/ i.e. match 'mir' or 'let' or 'bantam' at the beginning of the line ADD REPLYlink 3 Entering edit mode 18 months ago France/Nantes/Institut du Thorax - INSE… awk '/^-/ {next;} /^,/ {printf("%s%s\n----------\n",M,$0);next;}{M=$0;}' input.txt  ADD COMMENTlink 0 Entering edit mode could you help about this ?  Organism: aau, ,mir162 ,,ATAAACCTCTGCATCCAG ,,CGATAAACCTCTGCATCC convert to : Organism: aau, mir162,,ATAAACCTCTGCATCCAG Organism: aau, mir162,,CGATAAACCTCTGCATCC  ADD REPLYlink 1 Entering edit mode I'm not very good at awk, but you can always do it step by step if you don't know how to do it in one command, for example something like this awk '/,mir*/{mir=$0}; /,[ATGC]*$/{printf mir$0 "\n"}' test.data > test1.data
#adding the mir162 to the sequences
awk '/^Organism/{org=$0" " }; /^mir*/ {printf org$0 "\n"}' test1.data > test2.data
#adding the organism to the lines
sed 's/    ,,/,,/g' test2.data > test3.data
#modifing the lines the way you want them to be    sed 's/find/replace/g' filename  : finds and replaces for example here it removes the extra spaces before the commas

1
Entering edit mode

Hey Afagh, I posted a solution in my answer (above), in case you wanted to take a look. It only takes one line.

AWK can be difficult to learn, but it is worth the time to learn it! :) It is very powerful.

0
Entering edit mode

Hi Kevin,

Your code works fine, but why awk '/^Organism/{org=$0" "}; /,mir/{mir=$0}; /,[ATGC]*$/ {print org mir$0 "\n"}' test.data | sed 's/ ,/,/g' doesn't work?

Outputs:

Organism: aau, Organism: aau,

Organism: aau, ,mir162,,ATAAACCTCTGCATCCAG

Organism: aau, ,mir162,,CGATAAACCTCTGCATCC

1
Entering edit mode

You need to remove all leading whitespace with sed 's/[[:space:]]\+,//' MyMirData.txt first, and then pipe this into awk.

1
Entering edit mode

Keep working with it and you'll get it! :) We love AWK

1
Entering edit mode

Amazing Kevin! Thanks to people like you!

0
Entering edit mode

Pierre is expert at awk too - better than me I think.