Insert a pattern in a sequence
4
0
Entering edit mode
7.0 years ago
jaqy ▴ 20

Hi,

I have 200 sequences in fasta format (1000 bp) and a list of 10 patterns (20 bp) I try to replace 20 bp of each sequence by one pattern from the list (the idea is to insert a pattern in each sequence and the output sequences should be 1000 bp). the choice of the pattern to insert and the position should be do randomly ? can you help me please ?

Thank you in advance

UPDATED with new information:

Hello all, thank you very much for your reply. i have a system and i want to test this system to detect 20 patterns. so, i want to have 200 sequences (ADN)whith a same size (1000 bp). in each sequence, we have one pattern from the set in a randomly position. first, i have prepare 200 random sequences but my objective is to insert patterns in sequences.

i give you an exemple i have 200 sequences :

seq1 ATCGATCGAT........GCTGCATGCAT (1000 bp)
seq2 ATCGGTCGTAT........GCTGGATCGCAT (1000 bp)
:
:
:
seq199 ATCGCATGAT........GCTGCATGCATGT (1000 bp)
seq200 ATCGATCGAT........GCTGCATGCATACG (1000 bp)

and a set of 20 patterns

ATGTC
ATCGT
ATGCT
...

and i want to insert 1 patterns (randomly choice) in each sequence in a randomly position. for exemple

seq1 ATCGATCGAT.....ATGTC...GCTGCATGCAT (1000 bp)
seq2 ATCGGTCGTAT....ATCGT....GCTGGATCGCAT (1000 bp)
:
:
:
seq199 ATCGCATGAT.....ATGCT...GCTGCATGCATGT (1000 bp)
seq200 ATCGATCGAT....ATGTC....GCTGCATGCATACG (1000 bp)

We can reason otherwise, prepare 200 sequence (with 995 bp), and insert patterns randomly in sequences (without substituting) to obtain in tne end 200 sequences 1000bp (because patterns have the same size 5 bp, so 995+5=1000bp).

thank you

sequence mutation random • 2.9k views
ADD COMMENT
1
Entering edit mode

Your question contains insufficient information to get an answer. Including an example is helpful too!

ADD REPLY
0
Entering edit mode

Writing a script is the best and easiest way.

for every sequence
    generate a random integer between 1 to 981 as the insert position, LOC
    randomly choose a pattern
    concatenate upstream sequence (1 to LOC-1), pattern, and down-stream seq (LOC+20 to 1000)

Substitution operation is not needed.

I post a way down there using seqkit and shell, it's dirty but works.

ADD REPLY
1
Entering edit mode
7.0 years ago
BioinfGuru ★ 1.7k

First convert the fasta from a multiline fasta to a single line fasta:

1) Multiline Fasta To Single Line Fasta

2) http://bioinformatics.cvr.ac.uk/blog/short-command-lines-for-manipulation-fastq-and-fasta-sequence-files/

Then take a look at "sed"

sed -i 's/day/night/g' file.fa

find all occurrences of "day" in file.fa and replace them with "night"

s means substitude

g means global - sed also supports regular expressions

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/linux.html

ADD COMMENT
1
Entering edit mode

While this may work partially it would need OP to select day part manually (for 200 sequences). It also appears that the replacement may be needed/wanted only once per sequence.

jaqy : What is this trying to achieve?

ADD REPLY
2
Entering edit mode

well we cant do his/her homework completely for him/her now can we ;)

ADD REPLY
2
Entering edit mode

It is a very specific need but it doesn't quite make sense. So this may be homework but then it many not be. We will have to wait to find out.

ADD REPLY
1
Entering edit mode

This would address the OP's question, but not the part about randomly inserting the pattern pairs. It might be worth implementing a slightly more 'full fat' solution in perl or python to make use of their RNG's and regex support?

ADD REPLY
1
Entering edit mode
7.0 years ago
Joe 21k

Here's a partial solution that would handle the actual substitutions. As others have said though, you need to give us more information about the constraints you have (should there be only on subsititution per sequence? Should every instance be replaced? Should one string always be replaced by another string once they've been randomly paired or should every additional occurence also be randomly switched around?)

It's good programming practice to break your problem up in to parts. Firstly, you need a method of substituting one string for another, but since those strings are going to change/be random, this is ideally suited to a variable. There are loads of ways to do this (e.g. sed 's/"$var1"/"$var2"/g')

I would follow @kennethcondon2007's advice about linearising your fasta's first, then the small awk script below replaces instances of a string in 1 file by looking up pairs of values in another (keyfile).

#!/usr/bin/awk

# Usage:
# awk -f key_renamer.awk keyfile file_to_edit

NR==FNR { pattern[NR] = $1; replacement[NR] = $2; count++; next }
{
    for (i = 1; i <= count; i++) {
        sub(pattern[i], replacement[i])
    }
    print $0
}

The keyfile should be a 2 column file with a string to find in column 1, and a string to replace it with in column 2.

You could generate the column pairs in a key file via another random number generator method. I don't have code for this myself, but something like the random library of python would surely help out (http://stackoverflow.com/questions/20034292/generate-unique-and-random-list-of-string-pairs-python)

Similarly, perl is able to generate random numbers easily. This is a little function I use to randomly sample N lines from text files so I can check the consistency of formatting etc when performing bioinfx tasks on files:

randlines(){

# scriptname.sh filename numlines
# Generates N random lines from a file ($1) where N is $2

for ((i=0;i<"$2";i++)) ; do
 perl -e 'srand; rand($.) < 1 && ($line = $_) while <>; print $line;' $1
done
}
ADD COMMENT
1
Entering edit mode
7.0 years ago

Here's a tested way using seqkit and SHELL.

for every sequence
    generate a random integer between 1 to 981 as the insert position, LOC
    randomly choose a pattern
    concatenate upstream sequence (1 to LOC-1), pattern, and down-stream seq (LOC+20 to 1000)

Step 1, splitting FASTA sequences in to files, one sequence per file.

# test data
seqkit stats seqs.fa
file     format  type  num_seqs  sum_len  min_len  avg_len  max_len
seqs.fa  FASTA   DNA        200  200,000    1,000    1,000    1,000

# split FASTA sequences in to files, one sequence per file.
seqkit split -f -s 1 seqs.fa

tree seqs.fa.split
seqs.fa.split
├── seqs.part_001.fa
├── seqs.part_002.fa
├── seqs.part_003.fa
...

Step 2, preparing pattern (20bp) file, one sequence per line

cat patterns.txt 
aaaaaaaaaaaaaaaaaaaa
cccccccccccccccccccc
gggggggggggggggggggg
tttttttttttttttttttt
nnnnnnnnnnnnnnnnnnnn

Step 3, handling every sequence file

# a function to print up-stream, pattern, and down-stream of insert point.
#
# $1, insert loction
# $2, index of pattern
# $3, fasta file, one seq per file
# $4, pattern file, one seq per line
seq_segments () {
    seqkit subseq -r 1:$(($1-1)) $3 | seqkit seq -s -w 0
    sed -n ${2}p $4
    # 20 is the max length of patterns
    seqkit subseq -r $(($1+20)):-1 $3 | seqkit seq -s -w 0
}

pf=patterns.txt
let N=$(wc -l $pf | cut -d " " -f 1)
for f in seqs.fa.split/*.fa; do
    # 980 = 1000 - 20 (max pattern lenght)  
    let i=$(( $RANDOM % 980 + 1)) # random insert position
    let n=$(( $RANDOM % $N + 1))  # random pattern
    echo -e "insert ${n}th pattern into location $i"

    # print tab-delimited format and covert back to FASTA format
    echo -e $(seqkit seq -n $f)"\t"$(seq_segments $i $n $f $pf | paste -s -d "" -) \
        | seqkit tab2fx > $f.new
done

# join all handled file
cat seqs.fa.split/*.new > seqs.new.fa

Result

seqkit stats seqs.fa seqs.new.fa 
file         format  type  num_seqs  sum_len  min_len  avg_len  max_len
seqs.fa      FASTA   DNA        200  200,000    1,000    1,000    1,000
seqs.new.fa  FASTA   DNA        200  200,000    1,000    1,000    1,000

Checking distribution of inseted locations using csvtk (v0.7.1) or later versions.

# convert patterns to FASTA format
csvtk mutate -H -t patterns.txt | seqkit tab2fx > patterns.fa

# find all inserted patterns
seqkit locate -f patterns.fa seqs.new.fa > pattern_locations.tsv

csvtk plot line -t -x start -y start --xlab "insert position" \
    --ylab "insert position" pattern_locations.tsv > loc.png

Frequencies of patterns:

csvtk freq -t -f pattern pattern_locations.tsv | csvtk -t pretty
pattern                frequency
cccccccccccccccccccc   75
aaaaaaaaaaaaaaaaaaaa   80
nnnnnnnnnnnnnnnnnnnn   90
gggggggggggggggggggg   75
tttttttttttttttttttt   80
ADD COMMENT
1
Entering edit mode

Writing a python/perl is much easier.

ADD REPLY
0
Entering edit mode

thank you for your reply but when i excute the first script, i have error ?

ADD REPLY
0
Entering edit mode

which script? please paste command and error info

ADD REPLY
0
Entering edit mode

Everything is fine. please can you give me the reference of seqkit to cite it. thank you very much

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

You will need to make sure you install seqkit - if that is the problem you're having.

ADD REPLY
0
Entering edit mode

yes i installed seqkit thank you

ADD REPLY

Login before adding your answer.

Traffic: 2010 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6