How to create a dataset using sequence file in python
1
0
Entering edit mode
9.8 years ago
Jason Lin • 0

I have a protein sequence file looks like this:

>102L:A MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL       -------------------------------------------------------------------------------------------------------------------------------------------------------------------XX

The first one is the name of the sequence, the second one is the actual protein sequence, and the first one is the indicator that shows if there is any missing coordinates. In this case, notice that there is two "X" in the end. That means that the last two residue of the sequence witch are "NL" in this case are missing coordinates.

By coding in Python I would like to generate a table which should look like this:

  1. name of the sequence
  2. total number of missing coordinates (which is the number of X)
  3. the range of these missing coordinates (which is the range of the position of those X)
  4. the length of the sequence
  5. the actual sequence

So the final results should looks like this:

>102L:A 2 163-164 164 MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

And my code looks like this so far:

total_seq = []
with open('sample.txt') as lines:
    for l in lines:
        split_list = l.split()

        # Assign the list number
        header = split_list[0]                                # 1
        seq = split_list[1]                                   # 5
        disorder = split_list[2]

        # count sequence length and total residue of missing coordinates
        sequence_length = len(seq)                            # 4

        for x in disorder:
            counts = 0
            if x == 'X':
                counts = counts + 1

        total_seq.append([header, seq, str(counts)])   # obviously I haven't finish coding 2 & 3

with open('new_sample.txt', 'a') as f:
    for lol in total_seq:
        f.write('\n'.join(lol))

I'm new in python, would anyone help please, thank you so much guys!

python • 4.9k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

It helped. But for this I still don't understand how to solve number 2 and 3 in my goal. which is the total number of missing coordinates and the range of those missing coordinates.

ADD REPLY
0
Entering edit mode
9.8 years ago
Zhaorong ★ 1.4k

For question 2):

disorder = '---XX--XXX--'
print disorder.count('X')

This uses string's count() method.

For question 3):

from itertools import groupby, count
indices = [i for i, x in enumerate(disorder) if x=='X']

def as_range(iterable): # not sure how to do this part elegantly
    l = list(iterable)
    if len(l) > 1:
        return '{0}-{1}'.format(l[0], l[-1])
    else:
        return '{0}'.format(l[0])

print ','.join(as_range(g) for _, g in groupby(indices, key=lambda n, c=count():\
 n-next(c)))

This is more complicated. You may want to read these:

ADD COMMENT

Login before adding your answer.

Traffic: 1876 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6