Biostar Beta. Not for public use.
Split DNA sequence into dimers
1
Entering edit mode
15 months ago
erickfqqa • 20

I'm trying to split a list of dna kmers into dimers, for split them in nucleotides I've been using this tidyverse function

kmer <- "ATTCCCGG" 
ntd_kmr <- str_split_fixed(kmer,"",8)

and the output is the next

A,T,T,C,C,C,G,G

I would like to split the kmer into dimer so the output looks like the next

AT,TT,TC,CC,CC,CG,GG

I know that seqinr package has a function that do it, but I don't know how to do with overlapping

sequence R • 257 views
ADD COMMENTlink
0
Entering edit mode

you can do it in command line:

$cat test.txt
atgccg

$ sed -e 's/../&, /g' test.txt | sed -r 's/,\s$//'
at, gc, cg
ADD REPLYlink
0
Entering edit mode

Almost... expected output for your example input should be at, tg, gc, cc, cg.

ADD REPLYlink
0
Entering edit mode

with awk, could get it:

$ cat test.txt
  ATTCCCGG

$ awk '{for (i=1; i< length($0); i++) print substr($0,i,2)}' test.txt
  AT,TT,TC,CC,CC,CG,GG
ADD REPLYlink
2
Entering edit mode
11 months ago
JC 7.9k
Mexico

I think this is R, so you can simply do:

> kmer <- "ATTCCCGG" 
> x <- vector()
> for(n in seq(1, nchar(kmer) - 1, 1)) x <- c(x, substr(kmer, n, n+1) )
> x
[1] "AT" "TT" "TC" "CC" "CC" "CG" "GG"
ADD COMMENTlink
1
Entering edit mode

Avoid growing objects in a loop, use length argument for vector:

x <- vector(mode = "character", length = 7L)
# Or use character() which is just a wrapper for internal vector function
# x <- character(length = 7L)

for(n in seq(nchar(kmer) - 1)) x[ n ] <- substr(kmer, n, n + 1)
x
# [1] "AT" "TT" "TC" "CC" "CC" "CG" "GG"
ADD REPLYlink
0
Entering edit mode

yes, it is more efficient, but you need to know the string size to create an empty vector.

ADD REPLYlink
0
Entering edit mode

We know it already nchar(kmer) - 1).

ADD REPLYlink
2
Entering edit mode
3 months ago
zx8754 7.5k
London

We can use substring which takes a vector of start and end positions.

substring(kmer, first = 1:(nchar(kmer) - 1), last = 2:nchar(kmer))
# [1] "AT" "TT" "TC" "CC" "CC" "CG" "GG"
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1