Hi All,
As been working with awk for a couple of weeks, I've currently come across the problem that I would like to regulate how large my expression is. Below I'll give my example input, code, and desired output
Input: fasta file
>gene445
ATGACACCACTATACCTACACCCCCCCTCTTCAAAATTTTCTACAATACCTTCAAAATTTTCAAAACCATTACCACGTACTCCATTTTCTCTTCAACCCCTTATAAATAAACCCATACCCTTACATAATTACAATTCACTTCAAAATTTTAAATATCATAACCATCTCTACTCTTATTCC
--
>gene450
ATGTTAAAATAACTTCGTTTATCACATCTTCAAAATTTTATGTCTTGCCTTCAAAATTTTACATCATAAACAATCAAGTCATCGACTATTTCCTACTCTTACTAAACACAATCTCTCCACCTCATACACACT
--
>gene455
ATGTTATCTATACATCATTCATTATCATTCCCTTAATTCATCAATCTTCAAAATTTTACTTTTTCTTCAAAATTTTTTCGTACTAACCCAGTACTCTTTCATATCCAAACTATCTGCATAACTTAGATCCTTCAAAATTTTTAAACAGGCACA
Code: awk '/CC*CC/{print $0}' file.fasta
What I would like to regulate is this : CC*CC ; I basically would like to vary the inside of the the brackets CC-[ATCG-of >20]-CC; currently I get all instances of CC-any length-CC; additionally I would like to get the gene added;
Desired output:
>gene445
CCACTATACC
CCTACACC
CCTCTTCAAAATTTTCTACAATACC #I would then like to go though each and only filter >20 nt between CC-CC gaps
CCTTCAAAATTTTCAAAACC
CCATTACC
CCACGTACTCC
CCATTTTCTCTTCAACC
CCTTATAAATAAACC
CCATACC
CCTTACATAATTACAATTCACTTCAAAATTTTAAATATCATAACC
CCATCTCTACTCTTATTCC
Perhaps a way to think of what I really want my script to do is the following
1.) open fasta file
2.) identify gene
3.)find the first occurrence of CC
4.) Find the second occurrence of CC
5.) determine length
6.) if length is greater than 20
7.) then add gene# sequence
8.)#4 becomes #3
9.)look for next CC occurrence
10.) repeat step 5-9
11.) complete gene
12.) repeat 2-11
much help appreciated
Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.Thank you!
Thank you for that @genomax, will make sure to add this nest time
see
{n,m}
in https://www.gnu.org/software/gawk/manual/html_node/Regexp-Operator-Details.html#Regexp-Operator-DetailsI tried this but then the results are not what I am wanting
this seems to be very restrictive because its looking for CC-20XC-CC
this seems to be very inclusive because its looking for CC -any 20 including copies of CC- CC
Should a better solution be something along the lines of
Sting> read last occurrence of NCC>and go 20 bp upstream>copy 23 bp to file> then find next occurrence and repeat until length of gene is lower than 23 basepairs or no NCC are available. tag list with gene number and move on to next gene?
What I want is a list of 23 bp that I can potentially use for crisper targets.