Question

Enriched Regions Search Program

6

Entering edit mode

13.9 years ago

Noyk ▴ 70

Hi all, I need to find, in a set of fasta sequences, sequence that has a region of 30 or more Gln and Asn residues per 80-residue window (Gln/Asn-Rich Tracts). Any programs I can use out there? There is a mention about patmatch at this forum, can it be adapted to my purpose?

enrichment sequence programming • 3.2k views

ADD COMMENT • link updated 13.9 years ago by Neilfws 49k • written 13.9 years ago by Noyk ▴ 70

3

Entering edit mode

you are not looking for a motif, you are looking for enrichment in regions of a sequence.

ADD REPLY • link 13.9 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

@noyk Care to change the title according to @giovanni's suggestion? Great question :)

ADD REPLY • link 13.9 years ago by Eric Normandeau 11k

0

Entering edit mode

thanks.. will try out your suggestions

ADD REPLY • link 13.9 years ago by Noyk ▴ 70

0

Entering edit mode

remember to select the answer that you want to accept, to thank its author.

ADD REPLY • link 13.9 years ago by Giovanni M Dall'Olio 28k

Ram · Answer 1 · 2010-05-14

8

Entering edit mode

13.9 years ago

Neilfws 49k

Just for completeness, a BioRuby solution to count Q + N in 80-residue moving window.

require 'rubygems'
require 'bio'

Bio::FlatFile.foreach("myseqs.fa") do |sequence|
  sequence.seq.split("").each_cons(80) do |window|
    puts window.count("Q") + window.count("N")
  end
end

Here, each sequence is split into an array and the each_cons() array method provides the moving window - it returns nil if the sequence length is < 80.

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.9 years ago by Neilfws 49k

1

Entering edit mode

neat! I love looking at small code snippets

ADD REPLY • link 13.9 years ago by Istvan Albert 100k

1

Entering edit mode

Cool. I dont use much ruby, and had never heard of each_cons. a decent summary: http://www.themomorohoax.com/2009/01/29/ruby-each-cons

Isn't that going to print the count for every possible 80 bp window even if it's zero?

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 13.9 years ago by brentp 24k

0

Entering edit mode

Yes, this snippet would print the count for each window. You'd need to add a conditional (just another couple of lines), to print only if at least one of the window counts is >= 30.

ADD REPLY • link 13.9 years ago by Neilfws 49k

Ram · Answer 2 · 2010-05-13

If you use numpy, it's easy and fast to take an 80bp window using convolve

The script below assumes you're giving it a protein fasta and prints out sequences that are "enriched", but if you need the exact position where the enrichment occurs within each sequence, you'll need to add a couple lines (lookup numpy.where). runs in a few seconds for a 4.5MB file.

from pyfasta import Fasta
import numpy as np

fasta = 'orf_trans.fasta'
window = 80
cutoff = 30

f = Fasta(fasta)
kern = np.ones((window,))

enriched = 0
for k in f.keys():
    seq = np.array(f[k], dtype='c')
    # get a boolean array with True where AA is one of 'Q', 'N'
    a = (seq == 'N') | (seq == 'Q')

    # take an 80bp moving window.
    mw = np.convolve(a, kern, mode='same')
    # if it doesn't have any windows > 30, skip it.
    if mw.max() < cutoff: continue
    enriched += 1
    # can also add something here to find the exact loc of the enrichement(s)
    print k.split()[0]

print "\n%s out of %i sequences are enriched" % (enriched , len(f.keys()))

Ram · Answer 3 · 2010-05-13

4

Entering edit mode

13.9 years ago

Giovanni M Dall'Olio 28k

Well, a very didactic python solution would be:

import re
seq = 'ACGGTCGIWYDEAJCFSHDKANNNNNQNQNQNQNQNDGASDGADJHDAGJDAQNNAQHKJHSHDDAAHHQNNQNQ'
score_by_position = {}
gln_or_asn_matcher = re.compile('[QN]')
step = 10 # replace it with 80 on the real seq
cutoff = 5 # replace it with 30

# note that this ignores all windows smaller than $step at the left and right of the sequence    
for pos in range(0, len(seq)-step):
    score = len(gln_or_asn_matcher.findall(seq[pos:pos+step]))
    score_by_position[pos] = score
    if score > cutoff:
        print 'possible enriched region at position %s; %s' % (pos, seq[pos:pos+step])

# requires matplotlib
from matplotlib.pylab import bar, show
bar(score_by_position.keys(), score_by_position.values())
xticks(range(len(seq)), seq)
show()

of course, there it should be other tools that do this and in a more secure way. you should test this script with other values before using it.

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.9 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Just a minor correction:

from matplotlib.pylab import bar, show, xticks

It is generally not good to import individual functions from a module.

You can do:

import matplotlib.pyplot as plt
plt.bar(score_by_position.keys(), score_by_position.values())
plt.xticks(range(len(seq)), seq)
plt.show()

ADD REPLY • link updated 4.7 years ago by Ram 43k • written 13.9 years ago by Zhaorong ★ 1.4k

0

Entering edit mode

I wonder whether getting all the match positions into a list first (for the whole sequence), then looking at 80 bp windows across these indices would be faster on large scale (handily these are now in sorted order), that would avoid repeatedly invoking the findall. I am on the road so I can't try this out, will see if I get to it.

ADD REPLY • link 13.9 years ago by Istvan Albert 100k

Ram · Answer 4 · 2010-05-14

Here is my java solution:

File 'SearchGlnAsn.java':

import java.util.LinkedList;

public class SearchGlnAsn
    {
    public static void main(String[] args) throws Exception
        {
        int c;
        int pos=0;
        int count_N_Q=0;
        LinkedList&lt;Boolean&gt; window=new LinkedList&lt;Boolean&gt;();
        StringBuilder name=new StringBuilder();
        while((c=System.in.read())!=-1)
            {
            if(c=='&gt;')
                {
                name.setLength(0);
                while((c=System.in.read())!='\n') name.append((char)c);
                pos=0;
                count_N_Q=0;
                window.clear();
                }
            else if(!Character.isWhitespace(c))
                {
                pos++;
                boolean isAA=(c=='N' || c=='Q' || c=='n' || c=='q');
                if(isAA) count_N_Q++;
                window.add(isAA);
                if(window.size()==80)
                    {
                    if(count_N_Q&gt;30)
                        {
                        System.out.println(name+"\t"+count_N_Q+"\t"+(pos-80));
                        }
                    if(window.remove()) count_N_Q--;
                    }

                }
            }

        }

    }

Compilation:

javac SearchGlnAsn.java

Execution:

curl -s ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz |\
gunzip -c |\
java SearchGlnAsn |\
awk -F '    ' '{if($2>75) print $0; }'

sp|Q156A1|ATX8_HUMAN Ataxin-8 OS=Homo sapiens GN=ATXN8 PE=1 SV=1    79  0
sp|Q54UG3|Y1095_DICDI UPF0746 protein DDB_G0281095 OS=Dictyostelium discoideum GN=DDB_G0281095 PE=3 SV=1    76  415

 curl "http://www.uniprot.org/uniprot/Q156A1.fasta"
>sp|Q156A1|ATX8_HUMAN Ataxin-8 OS=Homo sapiens GN=ATXN8 PE=1 SV=1
MQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
QQQQQQQQQQQQQQQQQQQQ