Biostar Beta. Not for public use.
Calculate percentage of local blast hit
0
Entering edit mode
16 months ago
fec2 • 20

Hi all,

I need to calculate percentage of blastp hits from a few genome as formula below:

[(C1 + C2)/(T1 + T2)] * 100

C1 and C2 is the number of blastp hits of two genomes against each other, for example genome A against genome B represent C1 and genome B against genome A represent C2. T1 and T2 is the total number of proteins in the two genomes being compared.

Now, I have a .txt file consists total number of protein for all my genome of interest:

A:1234 B:1234 C:1234...

I also have another .txt file consists of blastp result for each other genome:

A_B 123, B_A 123, A_C 123, B_C 123..

Is it possible to have a command that is able to calculate the percentage of A_B, A_C and B_C based on formula above?

I apologise if my question is confusing and please tell me if any part of my question is unclear.

gene genome • 366 views
0
Entering edit mode

a single (linux) cmdline will be difficult I guess. Are you familiar with any programming/scripting language? Writing a small script will have that processed quickly.

0
Entering edit mode

Hi, I am not familiar with any programming/scripting language. I am actually kind of new in this field.

3
Entering edit mode
16 months ago
SMK ♦ 1.3k
Ghent, Belgium

Hi fec2,

This should work:

$cat prot_numbers.txt A:1234 B:1234 C:1234$ cat blastp.txt
A_B 123
B_A 124
A_C 123
C_A 125
B_C 123
C_B 126

$awk -F'[:_ ]' 'FILENAME=="prot_numbers.txt" {prot[$1]=$2; next} {a[1]=$1; a[2]=$2; asort(a); num[a[1]"-"a[2]]+=$3; deno[a[1]"-"a[2]]=prot[a[1]]+prot[a[2]]} END {for (pair in num){print pair, num[pair]*100/deno[pair]}}' prot_numbers.txt blastp.txt
B-C 10.0891
A-B 10.0081
A-C 10.0486


-F'[:_ ]' uses :, _, and (space) as delimiters. It first creates an array named prot to save the number of protein in each genome then adds the hits from A_B and B_A to an array named num with A-B as the key (after sorting the order of each pair: A_B -> A_B, B_A -> A_B). Finally, for each pair, compute (C1 + C2) num[pair] *100/(T1 + T2) deno[pair]. Note that in the results the pairs are unique already.

0
Entering edit mode

Hi SMK,

Thanks for your script. May I know what is the command "cat prot_numbers.txt" and "cat blastp.txt" for?

I got error as below: awk: calling undefined function asort input record number 1, file blastp.txt source line number 1

My actual data are like below:

prot_numbers.txt:

 Chryseomicrobium_excrementi_ET03.fasta:2640
Jeotgalibacillus_alimentarius_YKJ_13.fasta:3372


blastp.txt:

 Chryseomicrobium_excrementi_ET03.fasta_Filibacter_tadaridae_TB_66.fasta.tsv        1526
Chryseomicrobium_excrementi_ET03.fasta_Jeotgalibacillus_alimentarius_YKJ_13.fasta.tsv      1556
Jeotgalibacillus_alimentarius_YKJ_13.fasta_Chryseomicrobium_excrementi_ET03.fasta.tsv      1548


Thank you very much.

0
Entering edit mode

Hi fec2,

cat prot_numbers.txt and cat blastp.txt were just to show you how the example dataset I used looked like. Is your "actual" blastp.txt separated with a tab between the first and the second column?

0
Entering edit mode

I got the result using command from previous post (C: To get number of hits from blastp output file) as below:

(for i in *.tsv; do echo -ne "${i}\t" && awk '$3>=40{print $2}'${i} | sort -u | wc -l; done) > blastp.txt

in which I have multiple blastp output (format 6) in a directory, and I want to calculate the number of hits with sequence identity of more than 40% for each output file.

I am not sure is it separate by tab. I am so sorry for that.

0
Entering edit mode

Hi fec2,

From echo -ne "${i}\t" we know that the results will be tab (\t) separated, and we can use man echo to check out the meaning of parameters: -ne. Or, you can use cat -T blastp.txt to check, \t will be revealed as ^I (I really suggest you make it a chance to learn and to know what each component in the commands is for, instead of just copy/paste). In your original question, it was: A_B 123, while in your "actual" dataset it is something like A_B.tsv 123. I would suggest to first re-format your "actual" data to keep only prefix (species?) and use tab to separate them (because _ is also in your prefix which makes things complex a bit). Finally, you can change _ from the delimiters to \t for awk. This should get you there: sed "s/.fasta_/\t/g; s/.fasta.tsv//g" blastp.txt > blastp_re.txt sed "s/.fasta//g" prot_numbers.txt > prot_numbers_re.txt awk -F'[:\t]' 'FILENAME=="prot_numbers_re.txt" {prot[$1]=$2; next} {a[1]=$1; a[2]=$2; asort(a); num[a[1]"-"a[2]]+=$3; deno[a[1]"-"a[2]]=prot[a[1]]+prot[a[2]]} END {for (pair in num){print pair, num[pair]*100/deno[pair]}}' prot_numbers_re.txt blastp_re.txt

0
Entering edit mode

Hi SMK,

Thanks for your advice, I will learn it for the future use. However, I still got error after reformat the .txt file :

awk: calling undefined function asort
input record number 1, file blastp_re.txt
source line number 1


Any suggestion? Really thanks for your help.

1
Entering edit mode

Hi fec2,

Are you using the default awk in OSX? The commands (sed, awk) are based on the GNU Project's implementation, which should work in Linux (GNU) environment.

If you're in OSX, you can first do: brew install gnu-sed gawk (check out homebrew), and change the codes to:

gsed "s/.fasta_/\t/g; s/.fasta.tsv//g" blastp.txt > blastp_re.txt
gsed "s/.fasta//g" prot_numbers.txt > prot_numbers_re.txt

gawk -F'[:\t]' 'FILENAME=="prot_numbers_re.txt" {prot[$1]=$2; next} {a[1]=$1; a[2]=$2; asort(a); num[a[1]"-"a[2]]+=$3; deno[a[1]"-"a[2]]=prot[a[1]]+prot[a[2]]} END {for (pair in num){print pair, num[pair]*100/deno[pair]}}' prot_numbers_re.txt blastp_re.txt  Check here for more information about AWK implementations. ADD REPLYlink 0 Entering edit mode Thank you very much! The command is working fine now. ADD REPLYlink 0 Entering edit mode That's great, glad it helps! :-) ADD REPLYlink 0 Entering edit mode Hi, I wonder is it possible to have the result in matrix format? Thank you. ADD REPLYlink 0 Entering edit mode Can you provide an example? ADD REPLYlink 0 Entering edit mode For example:  A B C A 100 B 95 100 74 C 58 78 100  ADD REPLYlink 0 Entering edit mode If still sticking to what was asked at the beginning: Is it possible to have a command... You can try changing the code to: $ cat blastp.txt
A_B 123
B_A 124
A_C 123
C_A 125
B_C 123
C_B 126
$cat prot_numbers.txt A:1234 B:1234 C:1234$ awk -F'[:_ ]' 'FILENAME=="prot_numbers.txt" {prot[$1]=$2; next} {a[1]=$1; a[2]=$2; asort(a); num[a[1]"-"a[2]]+=$3; deno[a[1]"-"a[2]]=prot[a[1]]+prot[a[2]]} END {for (i in prot) {for (j in prot) {pair=i"-"j; if (pair in num) {printf num[pair]*100/deno[pair] "\t"} else {printf ".\t"}}; printf "\n"}}' prot_numbers.txt blastp.txt . 10.0081 10.0486 . . 10.0891 . . .  Because each pair is sorted alphabetically (A-B: row-col), here it only shows A-B, A-C, and B-C. ADD REPLYlink 0 Entering edit mode Hi, I got an error as below:  cmd. line:1: (FILENAME=blastp_re.txt FNR=6) fatal: division by zero attempted  ADD REPLYlink 0 Entering edit mode What was the input? ADD REPLYlink 0 Entering edit mode I m sorry, I missed out something important, the input blastp.txt should be:  A B 123 B C 321  And is separate by tab. I am sorry that I missed out this part Is not A_B and B_C. ADD REPLYlink 0 Entering edit mode I see. As I explained before: -F'[:_ ]' uses ":", "_", and " " (space) as delimiters If the inputs are: $ cat blastp.txt
A   B   123
B   A   124
A   C   123
C   A   125
B   C   123
C   B   126
$cat prot_numbers.txt A:1234 B:1234 C:1234  Changing the command to this should work: awk -F'[:\t]' 'FILENAME=="prot_numbers.txt" {prot[$1]=$2; next} {a[1]=$1; a[2]=$2; asort(a); num[a[1]"-"a[2]]+=$3; deno[a[1]"-"a[2]]=prot[a[1]]+prot[a[2]]} END {for (i in prot) {for (j in prot) {pair=i"-"j; if (pair in num) {printf num[pair]*100/deno[pair] "\t"} else {printf ".\t"}}; printf "\n"}}' prot_numbers.txt blastp.txt

0
Entering edit mode

Hi,

Thank you. However, the matrix doesn't have any label. Only has number and “.”. Could it be possible to have a label? Thanks again.

1
Entering edit mode

Alright, this should give you the labels (glad that it's still "a" command):

awk -F'[:\t]' 'FILENAME=="prot_numbers.txt" {prot[$1]=$2; next} {a[1]=$1; a[2]=$2; asort(a); num[a[1]"-"a[2]]+=$3; deno[a[1]"-"a[2]]=prot[a[1]]+prot[a[2]]} END {for (col in prot) {printf "\t" col}; printf "\n"; for (row in prot) {printf row; for (col in prot) {pair=row"-"col; if (pair in num) {printf "\t" num[pair]*100/deno[pair]} else {printf "\t."}}; printf "\n"}}' prot_numbers.txt blastp.txt  When inputs are: $ cat blastp.txt
A   B   123
B   A   124
A   C   123
C   A   125
B   C   123
C   B   126
\$ cat prot_numbers.txt
A:1234
B:1234
C:1234


It should return...

    A   B   C
A   .   10.0081 10.0486
B   .   .   10.0891
C   .   .   .

0
Entering edit mode

Thank you very much!

0
Entering edit mode

Could it be possible that the dataset has no self comparison? For example no A-A, B-B and C-C?