Biostar Beta. Not for public use.
bash and awk code
1
Entering edit mode
10 months ago
Sam • 110

Hello all, I'm new in bash. could you help me how could I solve this small technical problem ? I want to write a small script for copying the miRNAs name on the left of each sequence until a new one is found. file is in CSV and tab delimited format. Thanks

mir162
,ATAAACCTCTGCATCCAG
----------
,CGATAAACCTCTGCATCC
----------
,CGATAAACCTCTGCATCCAG
----------
mir172
----------
,AGAATCTTGATGATGCTGC

convert to :

----------
mir162,ATAAACCTCTGCATCCAG
----------
mir162,CGATAAACCTCTGCATCC
----------
mir162,CGATAAACCTCTGCATCCAG
----------
mir172,AGAATCTTGATGATGCTGC
ADD COMMENTlink
2
Entering edit mode

I added markup to your post for increased readability. I hope the format is correct as displayed. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink
4
Entering edit mode
10 months ago
Republic of Ireland

Does your data have those dashed lines in it? It didn't when I first looked.

Without your dashed lines:

paste -d"\0" <(awk '/^mir/{mir=$0}; /,[ATGC]*/{print mir}' MyMirData.txt) <(grep -v -e "^mir" MyMirData.txt)
mir162,ATAAACCTCTGCATCCAG
mir162,CGATAAACCTCTGCATCC
mir162,CGATAAACCTCTGCATCCAG
mir172,AGAATCTTGATGATGCTGC

...or:

awk '/^mir/{mir=$0}; /,[ATGC]*/{printf mir $0 "\n"}' MyMirData.txt
mir162,ATAAACCTCTGCATCCAG
mir162,CGATAAACCTCTGCATCC
mir162,CGATAAACCTCTGCATCCAG
mir172,AGAATCTTGATGATGCTGC

With the dashed lines

paste -d"\0" <(awk '/^mir/{mir=$0}; /,[ATGC]*/{print mir}' MyMirDataWithDashes.txt) <(grep -v -e "^mir" -v -e "--" MyMirDataWithDashes.txt)
mir162,ATAAACCTCTGCATCCAG
mir162,CGATAAACCTCTGCATCC
mir162,CGATAAACCTCTGCATCCAG
mir172,AGAATCTTGATGATGCTGC

...or:

awk '/^mir/{mir=$0}; /,[ATGC]*/{printf mir $0 "\n"}' MyMirDataWithDashes.txt
mir162,ATAAACCTCTGCATCCAG
mir162,CGATAAACCTCTGCATCC
mir162,CGATAAACCTCTGCATCCAG
mir172,AGAATCTTGATGATGCTGC
ADD COMMENTlink
0
Entering edit mode

Hey kevin, im not so good in awk and i would like to improve, could you explain this?

awk '/^mir/{mir=$0}; /,[ATGC]*/{printf mir $0 "\n"}'
ADD REPLYlink
3
Entering edit mode

Sure, my friend. AWK is a very powerful program. It was/is its own programming language.

The awk code above will be executed on each line in the input file.

The first part, /^mir/{mir=$0}, is saying: if the line begins with 'mir' (the carat ^ is a wild character that indicates beginning of a line), then assign the value of the line ($0) to a variable called 'mir'.

The second part, /,[ATGC]*/{printf mir $0 "\n"}, is saying: if the line contains a comma followed by any number and combination of ATGC (indicated by ,[ATGC]*), then print the value held in the 'mir' variable followed by the entre line and an end-line ('\n'). I probably could have also used ^,[ATGC]*$ here, with the ^ and $ being wild characters for the beginning and end of the lines. The square brackets indicate any of the letters in the brackets, and in any order.

Due to the way that this is structured, the 'mir' variable will only change when the net mir is encountered in the input file.

ADD REPLYlink
0
Entering edit mode

hi Kevin could you modify first part to have organism and mir name in same line ?

ADD REPLYlink
0
Entering edit mode

Sure, can you let me see the exact input?

ADD REPLYlink
0
Entering edit mode

input file format:

Organism: aau,
    ,mir162
    ,,ATAAACCTCTGCATCCAG
    ,,CGATAAACCTCTGCATCC

    convert to :

    Organism: aau, mir162,,ATAAACCTCTGCATCCAG
    Organism: aau, mir162,,CGATAAACCTCTGCATCC
ADD REPLYlink
1
Entering edit mode
cat MyMirData.txt 
Organism: aau,
    ,mir162
    ,,ATAAACCTCTGCATCCAG
    ,,CGATAAACCTCTGCATCC
Organism: aau,
    ,mir182
    ,,CCTCTGCATCCAG
    ,,AACCTCTGCATCC

sed 's/[[:space:]]\+,//' MyMirData.txt | awk '/^Organism/{varOrg=$0}; /^mir/{varMir=$0}; /^,[ATGC]*$/{print varOrg varMir $0}' | sed 's/,/, /g'
Organism: aau, mir162, ATAAACCTCTGCATCCAG
Organism: aau, mir162, CGATAAACCTCTGCATCC
Organism: aau, mir182, CCTCTGCATCCAG
Organism: aau, mir182, AACCTCTGCATCC
ADD REPLYlink
0
Entering edit mode

Thats very helpful man! thanks! so the two conditions are separated by ";" ?? Whats the use of "printf" here? you could have had the same output with

 awk '/^mir/{mir=$0};/,[ATGC]*/{print mir $0}'
ADD REPLYlink
0
Entering edit mode

I have posted the answer above, which should work with the 'Organism' included. It just involved an extra part to the awk command, but I also remove all leading whitespace from the file with sed before piping into awk.

Yes, you are quite correct regarding print and printf. I did not have to use printf.

ADD REPLYlink
0
Entering edit mode

Hi Kevin , how I can revise this code to consider "let" and "bantam" microRNAs ? because this code is just for mir and I have other miRNA such as let-7 and bantam-3p. Thanks for your help

ADD REPLYlink
2
Entering edit mode

Hi Sam,

Assuming that everything is in the same format, this should function correctly:

sed 's/[[:space:]]\+,//' MyMirData.txt | awk '/^Organism/{varOrg=$0}; /^mir|^let|^bantam/{varMir=$0}; /^,[ATGC]*$/{print varOrg varMir $0}' | sed 's/,/, /g'

The only part that I changed is the matching pattern for awk: /^mir|^let|^bantam/ i.e. match 'mir' or 'let' or 'bantam' at the beginning of the line

ADD REPLYlink
3
Entering edit mode
11 months ago
France/Nantes/Institut du Thorax - INSE…
awk '/^-/ {next;} /^,/ {printf("%s%s\n----------\n",M,$0);next;}{M=$0;}' input.txt
ADD COMMENTlink
0
Entering edit mode

could you help about this ?

  Organism: aau,
    ,mir162
    ,,ATAAACCTCTGCATCCAG
    ,,CGATAAACCTCTGCATCC
convert to :
Organism: aau, mir162,,ATAAACCTCTGCATCCAG
Organism: aau, mir162,,CGATAAACCTCTGCATCC
ADD REPLYlink
1
Entering edit mode

I'm not very good at awk, but you can always do it step by step if you don't know how to do it in one command, for example something like this

awk '/,mir*/{mir=$0}; /,[ATGC]*$/{printf mir $0 "\n"}' test.data > test1.data 
#adding the mir162 to the sequences
awk '/^Organism/{org=$0" " }; /^mir*/ {printf org  $0 "\n"}' test1.data > test2.data
#adding the organism to the lines
sed 's/    ,,/,,/g' test2.data > test3.data
#modifing the lines the way you want them to be    sed 's/find/replace/g' filename  : finds and replaces for example here it removes the extra spaces before the commas
ADD REPLYlink
1
Entering edit mode

Hey Afagh, I posted a solution in my answer (above), in case you wanted to take a look. It only takes one line.

AWK can be difficult to learn, but it is worth the time to learn it! :) It is very powerful.

ADD REPLYlink
0
Entering edit mode

Hi Kevin,

Your code works fine, but why awk '/^Organism/{org=$0" "}; /,mir/{mir=$0}; /,[ATGC]*$/ {print org mir $0 "\n"}' test.data | sed 's/ ,/,/g' doesn't work?

Outputs:

Organism: aau, Organism: aau,

Organism: aau, ,mir162,,ATAAACCTCTGCATCCAG

Organism: aau, ,mir162,,CGATAAACCTCTGCATCC

ADD REPLYlink
1
Entering edit mode

You need to remove all leading whitespace with sed 's/[[:space:]]\+,//' MyMirData.txt first, and then pipe this into awk.

ADD REPLYlink
1
Entering edit mode

Keep working with it and you'll get it! :) We love AWK

ADD REPLYlink
1
Entering edit mode

Amazing Kevin! Thanks to people like you!

ADD REPLYlink
0
Entering edit mode

Pierre is expert at awk too - better than me I think.

ADD REPLYlink
2
Entering edit mode

you are real biostarS :)

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1