Question

filter DNA sequences of defined length for certain bases at exact positions

0

Entering edit mode

6.7 years ago

philipp.rathert • 0

Dear all,

I was wondering if there is a possibility to filter DNA sequences of defined length (lets say about 30 bp) forthe presence of a specific base at a defined positin.

Something like: Give me only the sequences which contain a G at position 5 and an A at position 20.

Maybe this tool is so trivial that it was never implemented in galaxy or I am not seraching with the right key words.

does anybody have an idea?

Thank you very much,

Phil

sequence • 1.3k views

ADD COMMENT • link 6.7 years ago by philipp.rathert • 0

0

Entering edit mode

What is an input file? FASTA?

ADD REPLY • link 6.7 years ago by Paul ★ 1.5k

0

Entering edit mode

dear all, thank you very much,

the input file would be a simple fasta and the length of the seuqences is always the same.

with my very limited understanding of programming (I have never done this before) I will try to get this working on a windows PC...

CHeers,

phil

ADD REPLY • link 6.7 years ago by philipp.rathert • 0

0

Entering edit mode

Please if you can mark answer or up-vote any solution above.

ADD REPLY • link 6.7 years ago by Paul ★ 1.5k

0

Entering edit mode

philipp.rathert : Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

cpad0112 's solution may be the only one that will work natively on windows. Other two solution s will require python (@st.ph.n) and a virtual unix environment (@Pierre).

ADD REPLY • link 6.7 years ago by GenoMax 141k

score 0 · Answer 1 · 2017-07-27

0

Entering edit mode

6.7 years ago

Pierre Lindenbaum 161k

linearize and filter with awk;

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta |\
awk -F '\t' '(substr($2,5,1)=="G") && (substr($2,20,1)=="A")' |\
tr "\t" "\n"

ADD COMMENT • link 6.7 years ago by Pierre Lindenbaum 161k

score 0 · Answer 2 · 2017-07-27

Assuming single-line fasta input, or linearize with Pierre's awk statement. Each sequence has to have G @ 5 and A @ 20, and have a length of 30bp. If you want to know how many has G+A, G only, and A only you should find it easy to edit this code to do so.

#!/usr/bin/env python
import sys
with open(sys.argv[1], 'r') as f:
    for line in f:
        if line.startswith('>'):
            h = line.strip()
            s = next(f).strip()
            if len(s) == 30:
                    if s[5] == 'G' and s[20] == 'A':
                         print h, '\n', s

score 0 · Answer 3 · 2017-07-28

0

Entering edit mode

6.7 years ago

cpad0112 21k

my fasta:

>seq1
ACGTACTGCACC
>seq2
CTGATTTAGTCTATATGTATACT
>seq3
ATGACTATATCGTACGATCTAGTCCC

Requirements:

filter a sequence by length (12 bases in this example) and sequence should have G at the 3d position and C at the 12 position

Command:

$ seqkit seq -M 12  test.fa | seqkit grep -s  -R 3:3 -p G | seqkit grep -s -R 12:12 -p C

output:

$ seqkit seq -M 12  test.fa | seqkit grep -s  -R 3:3 -p G | seqkit grep -s -R 12:12 -p C

 >seq1
 ACGTACTGCACC

ADD COMMENT • link 6.7 years ago by cpad0112 21k

0

Entering edit mode

cpad0112 : Please include links to download seqkit in future answers. Everyone may not be aware of this tool.

ADD REPLY • link 6.7 years ago by GenoMax 141k