split fasta of a protein sequence into consecutive n amino acids
3
2
Entering edit mode
6.2 years ago
arraychip ▴ 30

let's say I have a fasta of a protein sequence

> albumin

MKWVTFISLL FLFSSAYSRG ... ... ...

I want to split the sequence into all possible consecutive 8 amino acids (only in 1 direction, amino -> carboxyl) (And no looping(I don't know if it is the right expression), say, GMKWVTFIS)

I need

> fasta.albumin1

MKWVTFIS

>fasta.albumin2

KWVTFISL

> fasta.albumin3

WVTFISLLF

...


> fasta.albumin13

FSSAYSRG

And, I want to do this for all known human protein sequences.

How would I do it??? I need the result as a fasta or fasta files. And IDs for resulting 8-mer seuqeunces need to be unique.

fasta • 3.4k views
ADD COMMENT
1
Entering edit mode

Have you tried to do this yourself? Please post your code you have already tried.

ADD REPLY
1
Entering edit mode

I am a biologist. I can google to search for news, but can't, and don't know any scripting. I was even struggling with my posting questions. Sorry

ADD REPLY
0
Entering edit mode

That is fine. This forum is there to help with that. I have formatted your post yet again. Please do not change the formatting back to plain text.

ADD REPLY
1
Entering edit mode

You say 5 AA in title but 8 in the body of post. Which is it?

There are ~20K+ validated human proteins at UniProt so results file is going to be gigantic.

Note: Why do you keep changing back the formatting of the data in your post to plain text?

ADD REPLY
0
Entering edit mode

Yes, it's going to be a 'big data' I really want try.

ADD REPLY
0
Entering edit mode

Hey @genomax

There is some issue with the code formatter below. A '(' is missing in my code print('>'+strrecord.id)+'|kmer_'+str(count)+'\n'+str(my_kmer)) below but looks okay when I check in the edit mode.

ADD REPLY
0
Entering edit mode

There is a known issue with python code formatting and biostars display of that code. I think the fix is to put a comment there to add the ( at proper place. Not elegant but will work for now.

Edit: Best solution is to put your code in a gist and then link that in your post. Biostars then looks the code up via gist and formats it correctly.

ADD REPLY
0
Entering edit mode

I see! I will opt for the gist option next time. Thanks

ADD REPLY
0
Entering edit mode

If you know python/perl then this should be easy to do. Someone is bound to put a program up as answer shortly. @Pierre will likely have a one liner in awk.

ADD REPLY
5
Entering edit mode
6.2 years ago

There are many methods. For long terms, some programming skill is worth learning.

Here's a simple way by using seqkit.

$ seqkit sliding -s 1 -W 8 seqs.fa
>albumin_sliding:1-8
MKWVTFIS
>albumin_sliding:2-9
KWVTFISL
...
>albumin_sliding:13-20
FSSAYSRG
ADD COMMENT
0
Entering edit mode

OK, I will definitely try. Thanks a lot !!!!!

ADD REPLY
0
Entering edit mode

This is the best solution +1

ADD REPLY
4
Entering edit mode
6.2 years ago
Sej Modha 5.3k

You requirement is to generate simple kmers from sequences. BioPython solution. Please change last print statement to: print('>'+strrecord.id)+'|kmer_'+str(count)+'\n'+str(my_kmer))

#!/usr/bin/env python3

from Bio import SeqIO

    myfile=SeqIO.parse('test.fa','fasta')
    for record in myfile:
        sequence=record.seq
        seq_len=len(sequence)
        #define the kmer len
        kmer=8
        count=0
        for seq in list(range(seq_len-(kmer-1))):
            count=count+1
            my_kmer=(sequence[seq:seq+kmer])
            print('>'+strrecord.id)+'|kmer_'+str(count)+'\n'+str(my_kmer))
ADD COMMENT
0
Entering edit mode

I know a little bit of R. But maybe this is a good chance to study (Bio)Python. I will definitely try, once I study a little bit of BioPython. Thanks a lot

ADD REPLY
0
Entering edit mode

Please accept the answers to indicate the question is resolved.

ADD REPLY
2
Entering edit mode
6.2 years ago

Using bioalcidaejdk:

$ java -jar dist/bioalcidaejdk.jar -e 'final int N=8;stream().filter(F->F.length()>=N).forEach(F->IntStream.range(0,F.length()-N).forEach(L->{out.println(">"+F.getName()+"."+(L+1)+"\n"+F.subSequence(L,L+N));}));' input.fasta
ADD COMMENT
2
Entering edit mode

I was hoping to see a awk one liner :-(

ADD REPLY

Login before adding your answer.

Traffic: 2592 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6