Question

Add string to list of protein sequences in fasta file with different lengh

0

Entering edit mode

5.7 years ago

Jason ▴ 10

How can I do this using python or shell command:

Suppose I have 200 protein sequences in fasta file named 'proto.fasta' with different lengths

I want to find maximum length between these sequences and then I want to add one string (Z) to all other sequences at the end of them to make all sequences have the same length of maximum length. Some sequence needs only one character to have the same length of maximum length but other sequences need more character to be the same maximum length. By the end, all the sequences will have the same length of maximum length.

After that, I want to save the result in the text file or fasta file

This is only example but I want to do same things on my sequences file for example:

>sp|P08112|MASY_ECOLI Malate synthase B OS=Escherichia coli (strain K12) GN=aceB PE=1 SV=1
MTEQATTTDELAFTRPYGEQEKQILTAEAVEFLTELVTHFTPQRNKLLAARIQQQQDID
NGTLPDFISETASIRDADWKIRGIPADLEDRRVEITGPVERKMVINA
LNANVKVFMADFED
>gi|84383531|gb|AER967412.1| C OS=Escherichia coli (strain K14)
MTEQATTTDELAFTRPYGEQEKQILTAEAVEFLTELVTHFTPQRN
KLLAARIQQQQDIDNGTLPD
>np|M04142|MASY_ECOLI tra synthase D OS=Escherichia coli (strain K16) GN=aceB 
MTEQATTTDELAFTRPYGEQEKQILTAEAVEFLTELVTHFTP
>np|S08112|MASY_ECOLI kw synthase S OS=Escherichia coli (strain K16) GN=aceB 
MTEQA

the result will be like this since first sequence the largest one we will add Z to all the sequences to make it equal to first sequences:

>sp|P08112|MASY_ECOLI Malate synthase B OS=Escherichia coli (strain K12) GN=aceB PE=1 SV=1
MTEQATTTDELAFTRPYGEQEKQILTAEAVEFLTELVTHFTPQRNKLLAARIQQQQDIDNGTLPDFISETASIRDADWKIRGIPADLEDRRVEITGPVERKMVINALNANVKVFMADFED
>gi|84383531|gb|AER967412.1| C OS=Escherichia coli (strain K14)
MTEQATTTDELAFTRPYGEQEKQILTAEAVEFLTELVTHFTP
QRNKLLAARIQQQQDIDNGTLPDZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ

>np|M04142|MASY_ECOLI tra synthase D OS=Escherichia coli (strain K16) GN=aceB 
MTEQATTTDELAFTRPYGEQEKQILTAEAVEFLTELVTHFTPZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ

>np|S08112|MASY_ECOLI kw synthase S OS=Escherichia coli (strain K16) GN=aceB 
MTEQAZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ

sequencing • 1.6k views

ADD COMMENT • link updated 5.7 years ago by benformatics 3.9k • written 5.7 years ago by Jason ▴ 10

1

Entering edit mode

Can I ask why? And why the Z character specifically?

ADD REPLY • link 5.7 years ago by Joe 21k

1

Entering edit mode

and a "protein" of length 5 looks very dubious to me

ADD REPLY • link 5.7 years ago by lieven.sterck 15k

score 0 · Answer 1 · 2018-08-31

from pyfaidx import Fasta, Faidx

# Create faidx of fasta file  
faidx = Faidx('test.fa')

# Get max len from faidx 
max_seq_name = max(faidx.index.keys(), key=(lambda k: faidx.index[k]["lenc"]))
max_len      = faidx.index[max_seq_name]["lenc"]


# Loop over fasta and adjust using list.extend("N" * (max - len(seq))
fa = Fasta('test.fa')

for f in fa.records:
    print(fa[f].long_name)
    seq = list(str(fa[f][:]).rstrip())
    seq.extend('N' * (max_len - len(seq)))
    print("".join(seq))

score 0 · Answer 2 · 2018-08-31

R implementation for you good sir

library(Biostrings)

## import your amino acid fasta file
fa <- readAAStringSet('your_fasta.fa')

## find and save the "width" of the entry with the longest length
max.aa.width.fa <- max(width(fa))

## find the difference between the max and each individual entry
width.diff <- max.aa.width.fa-width(fa)

## paste/cat N number of 'Z's to each entry, where N is the difference between that entry's length and the max
fa.extended <- xscat(fa,strrep('Z',width.diff))

## export your new subset as fasta (btw default for this function is fasta)
writeXStringSet(fa.extended,'your_fasta_extended.fa')