Question: How to extract upper/lowercase from fasta to output position and +/-
0
Entering edit mode

I'm looking for help on how to extract uppercase/lowercase information from multi fasta format file. I have been using the 'SNPable regions' method to mask a non-reference genome (http://lh3lh3.users.sourceforge.net/snpable.shtml). Hence, presently I have my information in the following format:

>Contig1
TCCcaccgcaacgaactg
>Contig2
ggTCgtgaagtaaaaaaa
>etc...

I want to compare this to another masking approach and need to format it. For simple comparison, I'd like it formattet as contig / position / + for uppercase or - for lowercase, i.e.:

Contig1 1 +
Contig1 2 +
Contig1 3 +
Contig1 4 -
Contig1 5 -
.......
.......
Contig2 1 -
Contig2 2 -
Contig2 3 +
Contig2 4 +
Contig2 5 -
....
....

Any simple fix?

2
Entering edit mode

A quick and simple solution would be using awk on the fasta file. Splitting the string into an array of characters and check with a regex whether the item is lower case or not:

awk '{if(NR%2==1){id=substr($1,2)};if(NR%2==0){n=split($0,a,""); for(i=1;i<=n;i++){if(match(a[i],/[a-z]/)){x="-"}else{x="+"}; printf "%s\t%d\t%s\n" , id,i,x}}}' test.fa

This will result in:

Contig1 1       +
Contig1 2       +
Contig1 3       +
Contig1 4       -
Contig1 5       -
...
Contig1 16      -
Contig1 17      -
Contig1 18      -
Contig2 1       -
Contig2 2       -
Contig2 3       +
Contig2 4       +
Contig2 5       -
Contig2 6       -
...

Denote, this will only work if your FASTA file is structured like the one in your example.

[Edit:] The script had a off-by-one error (i=0 instead of i=1).

ADD COMMENTlink 3.8 years ago michael.ante ♦ 3.3k
0
Entering edit mode

Thanks a lot Michael, very elegant solution :)

Ahh, you are correct with regards to structuring the FASTA file. I over simplified it a bit too much. In truth it is a multi fasta file with new line inserts for every 60 nt, so your quick fix only works for the first sequence line following >Contig. It is a very standard file like so:

>Contig1
ggaccagagaggttctcaccttcagtgcggcgatgaagttgtgTCCCGtCatcccgtcac
cagacatccacccgttgctgcgccaatcacagaactccttaaagccagttccctgatatg
acgccaaaaacttggcttctcgggctgctgcccgcgcctttcttgaagcgttcaacccgg
>Contig2
aacgcgtgttcgttgctgctgttgggttatgcaGTTTTGACCGTGGCGCAAAATACAAGA
AGCATAGCGCAAAGTGACGTTATTTAGCGATCAGTGAACACGCGAGCATTGACTAACGGA
AAAAGGGAAAAAGCATACGTACTGCTAACGCAGGCGCTCAGCCTGACGAAGGCGACGCGT
>etc
...

Perhaps some quick fix to your solution can overcome this? Otherwise removing all line breaks in the file and rerunning it?

ADD COMMENTlink 3.8 years ago MJS • 0
Entering edit mode
2

Search "linearize fasta" and pipe output of that to michael.ante's solution..

ADD REPLYlink 3.8 years ago
5heikki
8.4k
Entering edit mode
1

Correct, I'd go for the Fasta formatter from the fastx toolkit.

ADD REPLYlink 3.8 years ago
michael.ante
♦ 3.3k
Entering edit mode
0

Thank you for the help

ADD REPLYlink 3.8 years ago
MJS
• 0

Login before adding your answer.

Powered by the version 1.8