Biostar Beta. Not for public use.
Question: Run Prediction program on entire folder.
0
Entering edit mode

Hello!

I need to run python script on whole proteome, which is ~3600 proteins. To run it I require two seperate files: fasta sequence and profiles file. They both same file name, based on their headers. How do I run it on two folders (one with fasta sequences, and other with profile files)? Do I have to modify script, or write a new one?

ADD COMMENTlink 3.2 years ago enkh.tug • 0 • updated 3.2 years ago ole.tange ♦ 3.4k
Entering edit mode
0

Search for "bash loop files" and you'll find many hints, e.g., this one.

ADD REPLYlink 3.2 years ago
h.mon
25k
Entering edit mode
0

This should be reasonably straightforward to solve. Can you show the naming pattern of the fasta sequences and profile files? How do you know those files match to run the tool? Did you write the script yourself?

ADD REPLYlink 3.2 years ago
WouterDeCoster
39k
Entering edit mode
0

Script is not written by me. Fasta files are good, and profile files was generated by script creator. Naming pattern is: C_PROKKA_00001 - C_PROKKA_03211 for genome and pGX1_PROKKA for plasmids.

ADD REPLYlink 3.2 years ago
enkh.tug
• 0
Entering edit mode
0

So if I understood correctly you have 3211 files with name C_PROKKA_00001 up to C_PROKKA_03211. And what about the plasmids? I don't see you mentioned those before. Is it just one file? Every C_PROKKA file has to be ran together with the pGX1_PROKKA?

ADD REPLYlink 3.2 years ago
WouterDeCoster
39k
Entering edit mode
0

I have 3151 chromosomal genome sequences and 521 plasmid sequences. I had multifasta file and I splitted it based on headers. Each fasta sequence has corresponding profile file, and they has to be run together. Besides, I need to concentrate output file in one text file, but if I set output file and run second sequence it overwrittes it. Thank you for Your interest in my problem.

ADD REPLYlink 3.2 years ago
enkh.tug
• 0
Entering edit mode
0

I am not sure how you are using your command line but generally >> redirector will append data to an existing file.

ADD REPLYlink 3.2 years ago
genomax
68k
Entering edit mode
0

Can you post an example of the names of

  • a chromosomal genome sequence and it's profile file
  • a plasmid sequence and it's profile file

It should be fairly easy to do this, for example see the answer of ole.tange for a gnu-parallel solution.

ADD REPLYlink 3.2 years ago
WouterDeCoster
39k
Entering edit mode
0

To set clear expectations: If you are doing this on a single computer bash loop will still need to be serial. Depending on how heavy the computation is in this case (and how many core you have available on this machine) you could look into using GNU Parallel to expedite this process to some extent.

ADD REPLYlink 3.2 years ago
genomax
68k
1
Entering edit mode

I assume you can run:

my.py fasta/protein314.fa profile/protein314.profile

I also assume that the dir fasta and the dir profile contains nothing else and that the corresponding files are named the same

parallel my.py ::: fasta/* :::+ profile/*

The :::+ requires a fairly recent version of GNU Parallel. If you only have an older version:

parallel --xapply my.py ::: fasta/* ::: profile/*
ADD COMMENTlink 3.2 years ago ole.tange ♦ 3.4k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0