Biostar Beta. Not for public use.
Multifasta to Singlefasta (replace headers for 'NNNNNNNNNN', then join the multiple contigs)
0
Entering edit mode
19 months ago

Hi to all,

I'm trying to use a script to find IS on my genomes. But the script will run only on singlefasta files (i.e., complete genomes). Some of the genome files I have are multifasta files. I wonder if there is any python (or perl) script to remove all the contig headers of a file (e.g. ">contig-header"), BUT the first one, replacing all the headers for something like "NNNNNNNNNN" so I would be able to either map where the headers were before, but also to use the script I first mention only with the purpose of looking for IS on the multifasta2singlefasta files.

1
Entering edit mode

Changing data drastically so it would work with a script seems like a really bad idea to me.

1
Entering edit mode

Why not split the multi-fasta files into individual ones and run the tool on those files instead of doing what you are proposing?

0
Entering edit mode

Because some genomes have something like 300 contigs. It doesn't seem to me an idea with practicality (split them all). I would probably loose the track of each contig of a single file. The script for IS gives me the coordinates so I can track the IS local later.

1
Entering edit mode

Then you do not need the contig headers anyway. The coordinates will still be correct. Your proposed approach would actually be more difficult, since you’d be replacing the headers with an arbitrary number of Ns (and possibly different numbers of Ns depending on how you did the substitution), and so your base indices would be completely meaningless.

0
Entering edit mode

It didn't need to be an arbitrary number of Ns.

But anyway, thank you both very much genomax, jrj.healey and Ram for give me some clues and advices in a so fast way. I'll follow what you guys indicated/wrote.

0
Entering edit mode

Please search the forum for these various tasks, each one has very well documented solutions.

e.g.:

Concatenating sequences

Methods for manipulating fasta headers (one of many)

Give them a try. If you can’t make a solution work, come back and show us what you’ve tried and where you are stuck.

0
Entering edit mode

BTW,

An example of what I want the python script might do:

~BEFORE_MULTIFASTA_FILE~
ATCGATCGATCGATCG

ATCGATCGATCGATCG

~AFTER_SINGLEFASTA_FILE~
ATCGATCGATCGATCGNNNNNNNNNNATCGATCGATCGATCGATCG

1
Entering edit mode

I’m with Ram and the others, this seems like a bad approach. Just concatenate the file normally, as a separate file and keep the original.

Concatenating draft genomes (or more usually scaffolding them) is fairly normal.

0
Entering edit mode

To add to jrj.healey's point, always design tools to write to stdout or an explicitly specified output file. Modifying an input file is unexpected (verging on harmful) behavior.

0
Entering edit mode

It's not my script (the one who finds IS in complete genomes). Someone designed it like I described. I'm just trying to figure something to have a first look on my data (the ones there are not complete genomes).

0
Entering edit mode

OK, then I'd recommend creating a new single fasta file and working on that.

0
Entering edit mode

What is IS?