Question

split fasta of a protein sequence into consecutive n amino acids

2

Entering edit mode

6.2 years ago

arraychip ▴ 30

let's say I have a fasta of a protein sequence

> albumin

MKWVTFISLL FLFSSAYSRG ... ... ...

I want to split the sequence into all possible consecutive 8 amino acids (only in 1 direction, amino -> carboxyl) (And no looping(I don't know if it is the right expression), say, GMKWVTFIS)

I need

> fasta.albumin1

MKWVTFIS

>fasta.albumin2

KWVTFISL

> fasta.albumin3

WVTFISLLF

...


> fasta.albumin13

FSSAYSRG

And, I want to do this for all known human protein sequences.

How would I do it??? I need the result as a fasta or fasta files. And IDs for resulting 8-mer seuqeunces need to be unique.

fasta • 3.4k views

ADD COMMENT • link updated 6.2 years ago by Pierre Lindenbaum 161k • written 6.2 years ago by arraychip ▴ 30

1

Entering edit mode

Have you tried to do this yourself? Please post your code you have already tried.

ADD REPLY • link 6.2 years ago by Sej Modha 5.3k

1

Entering edit mode

I am a biologist. I can google to search for news, but can't, and don't know any scripting. I was even struggling with my posting questions. Sorry

ADD REPLY • link 6.2 years ago by arraychip ▴ 30

0

Entering edit mode

That is fine. This forum is there to help with that. I have formatted your post yet again. Please do not change the formatting back to plain text.

ADD REPLY • link 6.2 years ago by GenoMax 141k

1

Entering edit mode

You say 5 AA in title but 8 in the body of post. Which is it?

There are ~20K+ validated human proteins at UniProt so results file is going to be gigantic.

Note: Why do you keep changing back the formatting of the data in your post to plain text?

ADD REPLY • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

Yes, it's going to be a 'big data' I really want try.

ADD REPLY • link 6.2 years ago by arraychip ▴ 30

0

Entering edit mode

Hey @genomax

There is some issue with the code formatter below. A '(' is missing in my code print('>'+strrecord.id)+'|kmer_'+str(count)+'\n'+str(my_kmer)) below but looks okay when I check in the edit mode.

ADD REPLY • link 6.2 years ago by Sej Modha 5.3k

0

Entering edit mode

There is a known issue with python code formatting and biostars display of that code. I think the fix is to put a comment there to add the ( at proper place. Not elegant but will work for now.

Edit: Best solution is to put your code in a gist and then link that in your post. Biostars then looks the code up via gist and formats it correctly.

ADD REPLY • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

I see! I will opt for the gist option next time. Thanks

ADD REPLY • link 6.2 years ago by Sej Modha 5.3k

0

Entering edit mode

If you know python/perl then this should be easy to do. Someone is bound to put a program up as answer shortly. @Pierre will likely have a one liner in awk.

ADD REPLY • link 6.2 years ago by GenoMax 141k

score 5 · Accepted Answer · 2018-02-08

5

Entering edit mode

6.2 years ago

shenwei356 8.4k

There are many methods. For long terms, some programming skill is worth learning.

Here's a simple way by using seqkit.

$ seqkit sliding -s 1 -W 8 seqs.fa
>albumin_sliding:1-8
MKWVTFIS
>albumin_sliding:2-9
KWVTFISL
...
>albumin_sliding:13-20
FSSAYSRG

ADD COMMENT • link 6.2 years ago by shenwei356 8.4k

0

Entering edit mode

OK, I will definitely try. Thanks a lot !!!!!

ADD REPLY • link 6.2 years ago by arraychip ▴ 30

0

Entering edit mode

This is the best solution +1

ADD REPLY • link 5.7 years ago by scchess ▴ 640

score 4 · Accepted Answer · 2018-02-08

4

Entering edit mode

6.2 years ago

Sej Modha 5.3k

You requirement is to generate simple kmers from sequences. BioPython solution. Please change last print statement to: print('>'+strrecord.id)+'|kmer_'+str(count)+'\n'+str(my_kmer))

#!/usr/bin/env python3

from Bio import SeqIO

    myfile=SeqIO.parse('test.fa','fasta')
    for record in myfile:
        sequence=record.seq
        seq_len=len(sequence)
        #define the kmer len
        kmer=8
        count=0
        for seq in list(range(seq_len-(kmer-1))):
            count=count+1
            my_kmer=(sequence[seq:seq+kmer])
            print('>'+strrecord.id)+'|kmer_'+str(count)+'\n'+str(my_kmer))

ADD COMMENT • link 6.2 years ago by Sej Modha 5.3k

0

Entering edit mode

I know a little bit of R. But maybe this is a good chance to study (Bio)Python. I will definitely try, once I study a little bit of BioPython. Thanks a lot

ADD REPLY • link 6.2 years ago by arraychip ▴ 30

0

Entering edit mode

Please accept the answers to indicate the question is resolved.

ADD REPLY • link 6.2 years ago by Sej Modha 5.3k

score 2 · Accepted Answer · 2018-02-08

2

Entering edit mode

6.2 years ago

Pierre Lindenbaum 161k

Using bioalcidaejdk:

$ java -jar dist/bioalcidaejdk.jar -e 'final int N=8;stream().filter(F->F.length()>=N).forEach(F->IntStream.range(0,F.length()-N).forEach(L->{out.println(">"+F.getName()+"."+(L+1)+"\n"+F.subSequence(L,L+N));}));' input.fasta

ADD COMMENT • link 6.2 years ago by Pierre Lindenbaum 161k

2

Entering edit mode

I was hoping to see a awk one liner :-(

ADD REPLY • link 6.2 years ago by GenoMax 141k