Question

Controlling expressions with Awk

0

Entering edit mode

4.0 years ago

aaa.bioinfo • 0

Hi All,

As been working with awk for a couple of weeks, I've currently come across the problem that I would like to regulate how large my expression is. Below I'll give my example input, code, and desired output

Input: fasta file

>gene445
ATGACACCACTATACCTACACCCCCCCTCTTCAAAATTTTCTACAATACCTTCAAAATTTTCAAAACCATTACCACGTACTCCATTTTCTCTTCAACCCCTTATAAATAAACCCATACCCTTACATAATTACAATTCACTTCAAAATTTTAAATATCATAACCATCTCTACTCTTATTCC
--
>gene450
ATGTTAAAATAACTTCGTTTATCACATCTTCAAAATTTTATGTCTTGCCTTCAAAATTTTACATCATAAACAATCAAGTCATCGACTATTTCCTACTCTTACTAAACACAATCTCTCCACCTCATACACACT
--
>gene455
ATGTTATCTATACATCATTCATTATCATTCCCTTAATTCATCAATCTTCAAAATTTTACTTTTTCTTCAAAATTTTTTCGTACTAACCCAGTACTCTTTCATATCCAAACTATCTGCATAACTTAGATCCTTCAAAATTTTTAAACAGGCACA

Code: awk '/CC*CC/{print $0}' file.fasta

What I would like to regulate is this : CC*CC ; I basically would like to vary the inside of the the brackets CC-[ATCG-of >20]-CC; currently I get all instances of CC-any length-CC; additionally I would like to get the gene added;

Desired output:

>gene445
CCACTATACC   

CCTACACC

CCTCTTCAAAATTTTCTACAATACC  #I would then like to go though each and only filter >20 nt between CC-CC gaps

CCTTCAAAATTTTCAAAACC

CCATTACC

CCACGTACTCC

CCATTTTCTCTTCAACC

CCTTATAAATAAACC

CCATACC

CCTTACATAATTACAATTCACTTCAAAATTTTAAATATCATAACC

CCATCTCTACTCTTATTCC

Perhaps a way to think of what I really want my script to do is the following

1.) open fasta file

2.) identify gene

3.)find the first occurrence of CC

4.) Find the second occurrence of CC

5.) determine length

6.) if length is greater than 20

7.) then add gene# sequence

8.)#4 becomes #3

9.)look for next CC occurrence

10.) repeat step 5-9

11.) complete gene

12.) repeat 2-11

much help appreciated

bash awk gawk Terminal • 940 views

ADD COMMENT • link updated 4.0 years ago by GenoMax 141k • written 4.0 years ago by aaa.bioinfo • 0

0

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY • link 4.0 years ago by GenoMax 141k

0

Entering edit mode

Thank you for that @genomax, will make sure to add this nest time

ADD REPLY • link 4.0 years ago by aaa.bioinfo • 0

0

Entering edit mode

; I basically would like to vary the inside of the the brackets CC-[ATCG-of >20]-CC;

see {n,m} in https://www.gnu.org/software/gawk/manual/html_node/Regexp-Operator-Details.html#Regexp-Operator-Details

If there is one number followed by a comma, then the preceding regexp is repeated at least n times:

ADD REPLY • link 4.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I tried this but then the results are not what I am wanting

gawk '/CC{20,}CC/{print $0}' $tmpTop| wc -l.

this seems to be very restrictive because its looking for CC-20XC-CC

gawk '/CC?{20,}CC/{print $0}' $tmpTop| wc -l.

this seems to be very inclusive because its looking for CC -any 20 including copies of CC- CC

Should a better solution be something along the lines of

Sting> read last occurrence of NCC>and go 20 bp upstream>copy 23 bp to file> then find next occurrence and repeat until length of gene is lower than 23 basepairs or no NCC are available. tag list with gene number and move on to next gene?

What I want is a list of 23 bp that I can potentially use for crisper targets.

ADD REPLY • link updated 4.0 years ago by GenoMax 141k • written 4.0 years ago by aaa.bioinfo • 0