Question

Finding fasta files that share the highest number of identical headers

0

Entering edit mode

6.3 years ago

jan • 0

Hello,

I have about 100 fasta files and need to find those 10 that share the highest number of identical headers. I.e., I am looking for the 10 fasta files for which the number of species that are present in all of them is the highest. Any idea how to do that? A script (perl, python, bash) would be great, but software recommendations are also welcome.

Thank you and happy holidays!

sequence script perl python fasta • 1.6k views

ADD COMMENT • link updated 6.3 years ago by 5heikki 11k • written 6.3 years ago by jan • 0

0

Entering edit mode

6.3 years ago

5heikki 11k

This is definitely not the most efficient way..

#!/bin/bash
FILES=($(find . -maxdepth 1 -name "*.fa"))
for ((i=0; i<${#FILES[@]}; i++)); do
    for ((j=0; j<${#FILES[@]}; j++)); do
        echo -e "${FILES[$i]}\t${FILES[$j]}"
        comm -1 -2 \
            <(grep "^>" ${FILES[$i]} | sort) \
            <(grep "^>" ${FILES[$j]} | sort) \
            | wc -l
    done
done

ADD COMMENT • link 6.3 years ago by 5heikki 11k

0

Entering edit mode

I am getting a long list showing the number of headers, which are shared among two files each. Getting the names of the 10 files with the most shared headers remains unresolved. Anyway, thanks for trying!

ADD REPLY • link 6.3 years ago by jan • 0

0

Entering edit mode

#!/bin/bash
FILES=($(find . -maxdepth 1 -name "*.fa"))
for ((i=0; i<${#FILES[@]}; i++)); do
    for ((j=0; j<${#FILES[@]}; j++)); do
        SCOUNT=$(comm -1 -2 <(grep "^>" ${FILES[$i]} | sort) <(grep "^>" ${FILES[$j]} | sort) | wc -l)
        echo -e "${FILES[$i]}\t${FILES[$j]}\t$SCOUNT"
    done
done | sort -t $'\t' -k3,3g

The inner for loop stuff can be inside

if [[ ${FILES[$i]} != ${FILES[$j]} ]]; then
  ..
fi

And then you probably also want to filter out duplicates (file1 vs file2 / file2 vs file1) and finally pipe to tail -n10 to get just the 10 but I leave these to you

ADD REPLY • link 6.3 years ago by 5heikki 11k

score 4 · Accepted Answer · 2017-12-21

4

Entering edit mode

6.3 years ago

Pierre Lindenbaum 161k

cat  *.fasta | grep "^>" | sort | uniq -c | sort -n | tail

EDIT

(...) That's true. It gives the headers not the file names.

cat *.fasta | grep "^>" | sort | uniq | while read H; do grep -F "${H}" -l *.fasta; done | sort | uniq -c | sort -n

ADD COMMENT • link 6.3 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Assuming that post is about identifying files only, with same fasta headers, minor changes to code:

$ cat *.fasta | grep "^>" | sort | uniq  -d | while read H; do grep -Fw "${H}" -l *.fasta; done | sort | uniq -c | sort -n

output from OP code (edited .fasta to .fa in OP code):

$ cat *.fa | grep "^>" | sort | uniq | while read H; do grep -F "${H}" -l *.fa; done | sort | uniq -c | sort -n
      1 test1.fa
      1 test2.fa
      1 test3.fa
      1 test4.fa
      1 test6.fa
      2 test5.fa
      2 test7.fa

post minor modification:

$ cat *.fa | grep "^>" | sort | uniq -d | while read H; do grep -wF "${H}" -l *.fa; done | sort | uniq -c | sort -n
      1 test1.fa
      1 test2.fa
      1 test3.fa
      1 test4.fa
      1 test6.fa

To print files with same headers (with header, count and file names):

$ grep -h "^>" *.fa | sort | uniq -d | grep  -wf - *.fa | sort -k2 -t  ":" | datamash -t : -g2 count 2 collapse 1  | sed 's/:/\t/g'
>seq1   3   test1.fa,test2.fa,test6.fa
>seq2   2   test3.fa,test4.fa

test input files with .fa extension:

  $ cat test1.fa 
    >seq1
    atgc
    $ cat test2.fa 
    >seq1
    atgc
    $ cat test3.fa 
    >seq2
    gcta
    $ cat test4.fa 
    >seq2
    gcta
    $ cat test5.fa 
    >seq11
    atgc
    $ cat test6.fa 
    >seq1
    atgc
    $ cat test7.fa 
    >seq12
    cgtatata

ADD REPLY • link 6.3 years ago by cpad0112 21k

0

Entering edit mode

This is a great advice. Thank you very much! I had to replace "grep -wf - *.fa" by "grep -wf *.fa". I assume the dash was added accidentally, is that right? With the dash it wasn't working. However, now it's working but for all OTUs (headers), the number of files in which they are represented is reduced by 1 (the first occurrence is missing). Also, I don't understand the reason for the modifications of Pierre's command but it seems that with exception of this command (cat *.fa | grep "^>" | sort | uniq | while read H; do grep -F "${H}" -l *.fa; done | sort | uniq -c | sort -n) some numbers specifying the headers that occur in all files are slightly different now.

ADD REPLY • link 6.3 years ago by jan • 0

0

Entering edit mode

one-liner always rocks!

ADD REPLY • link 6.3 years ago by Nicolas Rosewick 10k

0

Entering edit mode

Hi Pierre! Thanks for the fast reply. This command, however, would give me the 10 fasta files that have the highest number of redundant headers within the same file. There are no redundant headers in each of my files. What I need is to find the 10 fasta files with the highest number of identical/redundant headers among all the fasta files.

ADD REPLY • link 6.3 years ago by jan • 0

0

Entering edit mode

my bad, I've fixed with cat

ADD REPLY • link 6.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

@Pierre : Your command seems to output the top 10 most redundant headers not the top 10 fasta file names non ?

ADD REPLY • link 6.3 years ago by Nicolas Rosewick 10k

0

Entering edit mode

@Pierre and Nicolas: That's true. It gives the headers not the file names.

ADD REPLY • link 6.3 years ago by jan • 0

0

Entering edit mode

Great, the edited version is doing the job :) Thank you, Pierre!

ADD REPLY • link 6.3 years ago by jan • 0