Question

Sub Sequence extract from single Fasta sequence

0

Entering edit mode

4.9 years ago

gnomicData • 0

Hi everyone,

I am trying to isolate a set of sequence ranging from 6 to 15 nucleotide in size from a single DNA sequence about 130 nt long. The goal is to isolate all sequences with size ranges starting from 6 nucleotide upto 15 nucleotide long. These need to be all seqeunces as well and not unique sequences only. For example, if my single large 130 nt DNA sequence is AATTGCGCTAGCGACTAGG........AGAGATAAGAGAGATTCTCTCA then starting from the left end, I will need AATTGC as well as ATTGCG and so on. Basically all possible 6 nucleotide sequences from the whole single 130 nt DNA sequence file. This pattern needs to be repeated for other sequence sizes as well (up to 15 nucleotides) Technically, it would be best to get the outputs in individual files (where each files will contain all possible sequences categorized by the size like 8, 9 , 10 ....up to 15). However, its also okay if the entire output is generated into a single file.

I need this code to work on R ( I am working with RStudio).

Currently I have tried splitseq but I really cannot loop with it. There i am providing a single value every time for the “frame” option and again a single value for the “word” option.

Something like:

splitseq(CDS, frame = 0, word = 6)   ## word=7, 8 ,9 etc for other nucleotide sizes extraction
[1] "TATCGT" "GCAGCA" "TCGGAC" "TAGCAT" "CGATCG" "ATTTAC" "ATCATC"

where CDS is the variable assigned with the sequence (~ 130nt)

so to get then next frame

splitseq(CDS, frame = 1 , word = 6)
[1] "ATCGTG" "CAGCAT" "CGGACT" "AGCATC" "GATCGA" "TTTACA" "TCATCT"

etc....

But this is not really automating the procedure and it will take me forever to complete the 6 nucleotide work, not to mention the rest of the nucleotide sizes upto 15 (word= 7, 8, 9,.... upto 15).

I have also come across a previous post here that talked about using a combination of DNAstrings, Views from the Biostrings package. However the problem is I need to loop through all sequences from 6 to 15 nucleotide size for all positions in this ~ 130 nt fasta sequence.

Any scripts/ suggestions will be most appreciated !

Thanks very much !!

sequence • 1.1k views

ADD COMMENT • link 4.9 years ago by gnomicData • 0

1

Entering edit mode

If R software is not a strict requirement then use jellyfish to get these counts (http://www.genome.umd.edu/jellyfish.html ).

ADD REPLY • link 4.9 years ago by GenoMax 141k

1

Entering edit mode

Also seqkit sliding:

for i in {6..15}; do seqkit sliding --step 1 --window ${i} test.fa > word${i}.fa; done

ADD REPLY • link 4.9 years ago by AK ★ 2.2k

0

Entering edit mode

Unfortunately I needed this in R only. But always good to know about tools (jellyfish) and their potentials. Thanks @Genomax !

ADD REPLY • link 4.9 years ago by gnomicData • 0

score 1 · Answer 1 · 2019-06-03

1

Entering edit mode

4.9 years ago

AK ★ 2.2k

Perhaps like this?

CDS <- "AATTGCGCTAGCGACTAGG"

for (w in 6:15) {
  sink(paste0("word", w, ".fa"))
  for (start in 1:(nchar(CDS) - w + 1)) {
    end <- start + w - 1
    sub_CDS <- substr(CDS, start, end)
    cat(paste0(">", w, "_", start, ":", end), "\n")
    cat(sub_CDS, "\n")
  }
  sink()
}