Sub Sequence extract from single Fasta sequence
1
0
Entering edit mode
4.9 years ago
gnomicData • 0

Hi everyone,

I am trying to isolate a set of sequence ranging from 6 to 15 nucleotide in size from a single DNA sequence about 130 nt long. The goal is to isolate all sequences with size ranges starting from 6 nucleotide upto 15 nucleotide long. These need to be all seqeunces as well and not unique sequences only. For example, if my single large 130 nt DNA sequence is AATTGCGCTAGCGACTAGG........AGAGATAAGAGAGATTCTCTCA then starting from the left end, I will need AATTGC as well as ATTGCG and so on. Basically all possible 6 nucleotide sequences from the whole single 130 nt DNA sequence file. This pattern needs to be repeated for other sequence sizes as well (up to 15 nucleotides) Technically, it would be best to get the outputs in individual files (where each files will contain all possible sequences categorized by the size like 8, 9 , 10 ....up to 15). However, its also okay if the entire output is generated into a single file.

I need this code to work on R ( I am working with RStudio).

Currently I have tried splitseq but I really cannot loop with it. There i am providing a single value every time for the “frame” option and again a single value for the “word” option.

Something like:

splitseq(CDS, frame = 0, word = 6)   ## word=7, 8 ,9 etc for other nucleotide sizes extraction
[1] "TATCGT" "GCAGCA" "TCGGAC" "TAGCAT" "CGATCG" "ATTTAC" "ATCATC"

where CDS is the variable assigned with the sequence (~ 130nt)

so to get then next frame

splitseq(CDS, frame = 1 , word = 6)
[1] "ATCGTG" "CAGCAT" "CGGACT" "AGCATC" "GATCGA" "TTTACA" "TCATCT"

etc....

But this is not really automating the procedure and it will take me forever to complete the 6 nucleotide work, not to mention the rest of the nucleotide sizes upto 15 (word= 7, 8, 9,.... upto 15).

I have also come across a previous post here that talked about using a combination of DNAstrings, Views from the Biostrings package. However the problem is I need to loop through all sequences from 6 to 15 nucleotide size for all positions in this ~ 130 nt fasta sequence.

Any scripts/ suggestions will be most appreciated !

Thanks very much !!

sequence • 1.1k views
ADD COMMENT
1
Entering edit mode

If R software is not a strict requirement then use jellyfish to get these counts (http://www.genome.umd.edu/jellyfish.html ).

ADD REPLY
1
Entering edit mode

Also seqkit sliding:

for i in {6..15}; do seqkit sliding --step 1 --window ${i} test.fa > word${i}.fa; done
ADD REPLY
0
Entering edit mode

Unfortunately I needed this in R only. But always good to know about tools (jellyfish) and their potentials. Thanks @Genomax !

ADD REPLY
1
Entering edit mode
4.9 years ago
AK ★ 2.2k

Perhaps like this?

CDS <- "AATTGCGCTAGCGACTAGG"

for (w in 6:15) {
  sink(paste0("word", w, ".fa"))
  for (start in 1:(nchar(CDS) - w + 1)) {
    end <- start + w - 1
    sub_CDS <- substr(CDS, start, end)
    cat(paste0(">", w, "_", start, ":", end), "\n")
    cat(sub_CDS, "\n")
  }
  sink()
}
ADD COMMENT
0
Entering edit mode

Absolutely fantastic ! Just what I needed Thank you very much !! @SMK

ADD REPLY

Login before adding your answer.

Traffic: 2782 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6