How can I run lastz on multiple files using a bash loop and an associative array?
2
1
Entering edit mode
6.3 years ago
SaltedPork ▴ 170

The lastz command will look like:

lastz reference.fasta sample.fasta --ambiguous=IUPAC --format=GENERAL > output.fasta

I need to run lastz once for every sample file, because each sample file has to be compared to its own individual reference (I have one reference for each sample).

I'm using find to get all of my 'sample' files, and will do the same for the reference files.

find . -type f -name 'sample.fasta' -not -path "*/temp/*"

The question is, how do I associate each sample file with each reference file, taking as input the output of the find command? Or is there a better way?

bash • 2.4k views
ADD COMMENT
1
Entering edit mode

Based on your find command it appears that your files are all in one local directory. Is there a way to tell sample (and corresponding ref) apart by just file extension? If that is the case why not do

find . -type f -name 'sample.fasta' -not -path "*/temp/*" > files
for i in `cat files | sed 's/.fasta//'`; do lastz $i.ref $i.fasta --ambiguous=IUPAC --format=GENERAL > $i\_out.fasta; done
ADD REPLY
3
Entering edit mode
6.3 years ago
mbens ▴ 100

It depends on the location of your reference files. Assuming reference and sample are located within one directory, you won't need associative arrays.

Example file structure:

├── a
│   ├── reference.fasta
│   └── sample.fasta
├── b
│   ├── reference.fasta
│   └── sample.fasta
└── c
    ├── reference.fasta
    └── sample.fasta

Collect paths to samples and references in separate files

find . -type "f" -name "sample.fasta" | sort > samples.list
find . -type "f" -name "reference.fasta" | sort > references.list

Create association sample - reference

paste -d ',' references.list samples.list > file_association.csv

Create commands

while IFS=","
read -r col1 col2
do
    x=$(dirname $col2)
    echo "lastz $col1 $col2 --ambiguous=IUPAC --format=GENERAL > $x/output.fasta"
done < file_association.csv

or using GNU parallel

cat file_association.csv | parallel --dry-run --colsep ',' "lastz {1} {2} --ambiguous=IUPAC --format=GENERAL > {1//}/output.fasta"

Resulting commands

lastz ./a/reference.fasta ./a/sample.fasta --ambiguous=IUPAC --format=GENERAL > ./a/output.fasta
lastz ./b/reference.fasta ./b/sample.fasta --ambiguous=IUPAC --format=GENERAL > ./b/output.fasta
lastz ./c/reference.fasta ./c/sample.fasta --ambiguous=IUPAC --format=GENERAL > ./c/output.fasta
ADD COMMENT
0
Entering edit mode

Wow, this is precisely what I want, and yes moving the files to same directory is a lot easier. Thankyou so much!

ADD REPLY
3
Entering edit mode
6.3 years ago
$ tree .
.
├── a
│   ├── reference.fa
│   └── sample.fa
├── b
│   ├── reference.fa
│   └── sample.fa
└── c
    ├── reference.fa
    └── sample.fa

3 directories, 6 files

using parallel (remove dry-run option to execute the code):

$ ls -d */ | parallel --dry-run 'lastz ./{}reference.fa ./{}sample.fa --ambiguous=IUPAC --format=GENERAL > ./{}output.fa'

commands to be executed (i.e dry-run of code):

lastz ./a/reference.fa ./a/sample.fa --ambiguous=IUPAC --format=GENERAL > ./a/output.fa
lastz ./b/reference.fa ./b/sample.fa --ambiguous=IUPAC --format=GENERAL > ./b/output.fa
lastz ./c/reference.fa ./c/sample.fa --ambiguous=IUPAC --format=GENERAL > ./c/output.fa
ADD COMMENT

Login before adding your answer.

Traffic: 2044 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6