Biostar Beta. Not for public use.
common sequence between multi fasta file
0
Entering edit mode
17 months ago
Sam • 110

Dear All

how I can retrieve common seqs between multiple fasta (12 ) files?

Thanks

bash • 732 views
1
Entering edit mode

I already obtain common seqs with this command , is it a correct way ?

comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(sort A1_novel_mirnas.fa) <(sort A2_novel_mirnas.fa)) <(sort B1_novel_mirnas.fa)) <(sort B2_novel_mirnas.fa)) <( sort C1_novel_mirnas.fa)) <(sort C2_novel_mirnas.fa)) <(sort D1_novel_mirnas.fa)) <(sort D2_novel_mirnas.fa)) <(sort E1_novel_mirnas.fa)) <(sort E2_novel_mirnas.fa)) <(sort F1_novel_mirnas.fa)) <(sort F2_novel_mirnas.fa)

0
Entering edit mode

So you have RNA sequence. Can you try replacing U with T and running @5heikki's script again?

0
Entering edit mode

Hello Sam,

what do you mean by "common seqs". Please describe more clearly what you have and what your goal is.

fin swimmer

0
Entering edit mode

Hi

Common mean sequence which is available in all fasta file. I want to find common sequences between all 12 fasta files.

0
Entering edit mode

Are there many sequences in each file or just one per file?

0
Entering edit mode

1
Entering edit mode

Doing quick blat searches is one option that should be pretty fast. Other option may be CD-HIT to do pair-wise comparisons.

0
Entering edit mode

The unix commandline utility comm will be useful here, so long as you can sort your files first. (Assuming ‘common’ in this question means the sequences or headers are identical)

0
Entering edit mode
1. make a list of the unique fasta sequences

cat *.mfa | grep -v ">" | sort -u > unique_sequences

2. loop each line in all .mfa file

cat unique_sequences | while read line; do grep -w $line *.mfa; done 3. you could easily go further and check which sequence is present 12 times ADD REPLYlink 1 Entering edit mode 18 months ago 5heikki 8.4k Finland This assumes that 1) blast is installed and that 2) the script is run in the dir where .fasta named files reside #!/bin/bash FILES=($(find . -maxdepth 1 -type f -name "*.fasta"))
for((i=0;i<${#FILES[@]}-1;i++)); do blastn -query${FILES[$i]} -subject${FILES[$i+1]} -outfmt '6 qseqid qlen length slen qseq' \ | awk 'BEGIN{OFS=FS="\t"}{if($2==$3 &&$2==$4){print ">"$1"\n"$5}}' > tmp.fasta cp tmp.fasta${FILES[\$i+1]}
done


So we blast first file against second file and put into tmp.fasta only the identical sequences between the files. Then we replace the second file with tmp.fasta. Then we blast that against the third file. Etc. Files will be deleted. If there are common sequences, in the end they will be in the tmp.fasta file. Here we assume that if query length = alignment length = subject length the sequences are the same. Of course, this might not be the case. I leave it for you how to make it more precise blastn -help

0
Entering edit mode

Did you mean to use blast2seq? I don't see blast indexes being made anywhere. Edit: That is the correct syntax for blast2seq (not having used it on the command line made me check again). @sam: If you are getting an error then post the exact error or explain what "is not working".

0
Entering edit mode

unfortunately it's not working and as Genomax mentioned for blastn we need index , which is missing

0
Entering edit mode

If your file names don't end in .fasta then you need to change the code above to account for that.

0
Entering edit mode

Warning: [blastn] Query is Empty! and it convert all fasta file to empty file!

0
Entering edit mode

I just tested @5heikki's script and it worked flawlessly.

Do all your files have consistent names/format?

0
Entering edit mode

If the sequences are really short add -task blastn-short to the command. Also, is it possible that you don't have any "universal" seqs?

0
Entering edit mode

the length of seqs are about 20 bp , how ever I tested with blastn-short and it produced a huge temp file over than 600 Mb from just 600 seqs (total amount of seqs from all files is about 600 ) . when I lunched it with just 4 lib it was fine but for all 12 libs it's not work correctly

Check from blastn -help output how to add a field to outfmt for sequence similarity and filter also based on that in the awk command. In case the "temp" file is still huge, you must have a lot of redundancy in your input.