Biostar Beta. Not for public use.
How To Extract Specific Coordinates From Multifasta File
6
Entering edit mode
6.5 years ago
Florianino • 70

Hi, I have a genome file in multifasta format

>chr1
ATATATACCGA

>chr2
TGGCGTATTAT

...

and I would like to extract specific coordinates from specific sequences. Coordinates are stored in a three columns file table

chr1 2 4
chr2 4 6
chr2 3 6
...

Ideally the output would compile the extracted sequences in a single file indicating the chromosome and coordinates in the header:

>chr1_2_4
TAT
>chr2_4_6
CGT
...

Could anyone help me with this?

Thanks a lot in advance Florianino

ADD COMMENTlink
8
Entering edit mode
19 months ago
United States

Here is my solution. Please have the _seqinr_ package installed and don't forget to chmod +x

#!/usr/bin/env Rscript

library(seqinr)

seqs <- read.fasta("a001.fasta")
locs <- read.delim("a001.coords", header=F, sep=" ") # Tabs? Put sep="\t"
outf <- file("a001-results.fasta", 'w')

doitall <- function(x) {
  seq_id <- x[[1]]; start <- x[[2]]; stop <- x[[3]]; seq <- seqs[[seq_id]]
  seqv   <- seq[start:stop]
  header <- paste(sep="", ">", attr(seq, "name"), "_", start, "_", stop, "\n")
  cat(file=outf, header)
  cat(file=outf, toupper(paste(sep="", collapse="", seqv)), "\n")
}
apply(locs, 1, doitall)
close(outf)

==> a001.coords <==

chr1 2 4
chr2 4 6
chr2 3 6

==> a001.fasta <==

>chr1
ATATATACCGA

>chr2
TGGCGTATTAT

==> a001-results.fasta <==

>chr1_2_4
TAT 
>chr2_4_6
CGT 
>chr2_3_6
GCGT
ADD COMMENTlink
0
Entering edit mode

Hey @Aleksandr, how should the fasta headers should be? Exactly what the 'a001.coords' col1 says, right?

ADD REPLYlink
6
Entering edit mode
2.3 years ago
United States

If you want to work with a plain text FASTA file, the fastaFromBed program in BEDTools will do this for you. The two inputs are your genome file in FASTA format and the coordinates in BED format. I would suggest using the unreleased version in the repository, as it uses Erik Garrison's nice FastaHack library for speed. If you already heave a ".fai" file from samtools, it will use that. Otherwise, it will create an index the first time, and reuse it thereafter.

fastaFromBed -fi in.fasta -bed regions.bed -fo out.regions.fa
ADD COMMENTlink
2
Entering edit mode
17 months ago
France/Nantes/Institut du Thorax - INSE…

you could try to use:

See also: http://biostar.stackexchange.com/questions/979

ADD COMMENTlink
1
Entering edit mode
6.5 years ago
Florianino • 30

Hi thanks a lot guys, I'm very new in the field, I really appreciate your help, bla bla bla

I have installed bedtool and tried fastafromBED but it looks like when I ask for positions 1 to 25, it gives me 2 to 25 instead in the output. How come?

For the other solution, I had to install R. I think there may a problem with the script because I use tab delimited file for coords and there are only three fields (BED format).

Cheers

Florianino

ADD COMMENTlink
1
Entering edit mode

Hi Florianino. Welcome to Biostar. I'm glad you installed R - it's the front-line tool in Bioinformatics. For tab delimited cords files, simply change sep=" " to sep="\t" on the read.delim line.

ADD REPLYlink
0
Entering edit mode

BED format uses zero-based, half-open coordinates, so the first 25 bases of a sequence are in the range 0-25 (those bases being numbered 0 to 24).

ADD REPLYlink
0
Entering edit mode

This would be better posed as a new question.

ADD REPLYlink
0
Entering edit mode

ok thanks, well I guess that's obvious for everyone.. BEDtools looks very handy! Florianino

ADD REPLYlink
0
Entering edit mode

@Florianino - I don't think this is obvious at all, see the following question inspired by your post: http://biostar.stackexchange.com/questions/6394/what-are-the-advantages-disadvantages-of-one-based-vs-zero-based-genome-coordina

ADD REPLYlink
0
Entering edit mode
6.5 years ago
Florianino • 30

Ok, I'll post the BED issue right now as new question. Thanks Aleksandr, i'll try that and let you know. Thank you every one. Things are much easier when people are connected Cheers

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1