Find conserved blocks and extract subalignment with Python
0
0
Entering edit mode
7.0 years ago
Kevin D ▴ 30

Hi everyone,

After aligning DNA sequences, I would like to extract a sub alignment as follow:

  • scroll the alignment and find the first and last X-bp conserved blocks where X is a number given by the user.
  • extract the sub alignment surronded by the two conserved blocks.

I tried this with python wich uses a sliding window to get chunks and analyze each of them:

#! /bin/python
# coding: utf-8

import os, sys, re, glob

# Perform alignment
#--------------------------------------------------------------------------
from Bio.Align.Applications import ClustalwCommandline
cline = ClustalwCommandline("clustalw2", infile = "opuntia2.fasta")
stdout, stderr = cline()
# Print alignment
from Bio import AlignIO
align = AlignIO.read("opuntia2.aln", "clustal")
print(align)
# Calculate alignment length in bp
n = int(align.get_alignment_length())
# Define sliding window parameters
winSize = 100
step = 1

# Define a sliding window
#---------------------------------------------------------------------------
def slidingWindow(align,winSize,step):

    # Pre-compute number of chunks to emit
    numOfChunks = ((n-winSize)/step)+1

    # Do the work
    for i in range(0,numOfChunks*step,step):
        start = i
        end = i+winSize 
        yield align[:,i:i+winSize]
        #yield start, end #if I uncomment this lines problems appear because this function's output is a multiple seq aln and then I cannot manipulate this object easily

# Define a function that check if a column has common nucleotides or not
#---------------------------------------------------------------------------
def all_same(items):
    return all(x == items[0] for x in items)


# Real process starts here
#--------------------------------------------------------------------------
# List all possible chunks by applying the sliding window on aln
chunks = slidingWindow(align,winSize,step)

for chunk in chunks:
    print chunk
    # Loop on every chunk to see if it is conserved or not
    for p in range(winSize):
        column = chunk[:, p]
        tester  = all_same(column)
        print tester

Then, I wanted to check if the tester variable is True for all position of each chunk. If true, then it means all the chunks is conserved and I could save its position in a record file and continue to examine each chunk. At the end, I could select the first and last line of the record file and extract the corresponding subalignment from the original one.

I clearly see that my solution is very awkward and I would be pleased if anyone could had new (simpler) perspectives!

multiple-sequence-alignment python • 3.4k views
ADD COMMENT

Login before adding your answer.

Traffic: 2787 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6