Question: Calculate percentage of local blast hit
0
Entering edit mode
3 months ago
fec2 • 20

Hi all,

I need to calculate percentage of blastp hits from a few genome as formula below:

[(C1 + C2)/(T1 + T2)] * 100

C1 and C2 is the number of blastp hits of two genomes against each other, for example genome A against genome B represent C1 and genome B against genome A represent C2. T1 and T2 is the total number of proteins in the two genomes being compared.

Now, I have a .txt file consists total number of protein for all my genome of interest:

A:1234 B:1234 C:1234...

I also have another .txt file consists of blastp result for each other genome:

A_B 123, B_A 123, A_C 123, B_C 123..

Is it possible to have a command that is able to calculate the percentage of A_B, A_C and B_C based on formula above?

I apologise if my question is confusing and please tell me if any part of my question is unclear.

ADD COMMENTlink 3 months ago fec2 • 20 • updated 3 months ago SMK ♦ 1.3k
Entering edit mode
0

a single (linux) cmdline will be difficult I guess. Are you familiar with any programming/scripting language? Writing a small script will have that processed quickly.

ADD REPLYlink 3 months ago
lieven.sterck
5.1k
Entering edit mode
0

Hi, I am not familiar with any programming/scripting language. I am actually kind of new in this field.

ADD REPLYlink 3 months ago
fec2
• 20
3
Entering edit mode
3 months ago
SMK ♦ 1.3k
Ghent, Belgium

Hi fec2,

This should work:

$ cat prot_numbers.txt
A:1234
B:1234
C:1234

$ cat blastp.txt
A_B 123
B_A 124
A_C 123
C_A 125
B_C 123
C_B 126

$ awk -F'[:_ ]' 'FILENAME=="prot_numbers.txt" {prot[$1]=$2; next} {a[1]=$1; a[2]=$2; asort(a); num[a[1]"-"a[2]]+=$3; deno[a[1]"-"a[2]]=prot[a[1]]+prot[a[2]]} END {for (pair in num){print pair, num[pair]*100/deno[pair]}}' prot_numbers.txt blastp.txt
B-C 10.0891
A-B 10.0081
A-C 10.0486

-F'[:_ ]' uses :, _, and (space) as delimiters. It first creates an array named prot to save the number of protein in each genome then adds the hits from A_B and B_A to an array named num with A-B as the key (after sorting the order of each pair: A_B -> A_B, B_A -> A_B). Finally, for each pair, compute (C1 + C2) num[pair] *100/(T1 + T2) deno[pair]. Note that in the results the pairs are unique already.

ADD COMMENTlink 3 months ago SMK ♦ 1.3k
Entering edit mode
0

Hi SMK,

Thanks for your script. May I know what is the command "cat prot_numbers.txt" and "cat blastp.txt" for?

I got error as below: awk: calling undefined function asort input record number 1, file blastp.txt source line number 1

My actual data are like below:

prot_numbers.txt:

 Chryseomicrobium_excrementi_ET03.fasta:2640
 Filibacter_tadaridae_TB_66.fasta:2970
 Jeotgalibacillus_alimentarius_YKJ_13.fasta:3372

blastp.txt:

 Chryseomicrobium_excrementi_ET03.fasta_Filibacter_tadaridae_TB_66.fasta.tsv        1526
 Chryseomicrobium_excrementi_ET03.fasta_Jeotgalibacillus_alimentarius_YKJ_13.fasta.tsv      1556
 Filibacter_tadaridae_TB_66.fasta_Chryseomicrobium_excrementi_ET03.fasta.tsv        1514
 Filibacter_tadaridae_TB_66.fasta_Jeotgalibacillus_alimentarius_YKJ_13.fasta.tsv        1576
 Jeotgalibacillus_alimentarius_YKJ_13.fasta_Chryseomicrobium_excrementi_ET03.fasta.tsv      1548
 Jeotgalibacillus_alimentarius_YKJ_13.fasta_Filibacter_tadaridae_TB_66.fasta.tsv        1577

Thank you very much.

ADD REPLYlink 3 months ago
fec2
• 20
Entering edit mode
0

Hi fec2,

cat prot_numbers.txt and cat blastp.txt were just to show you how the example dataset I used looked like. Is your "actual" blastp.txt separated with a tab between the first and the second column?

ADD REPLYlink 3 months ago
SMK
♦ 1.3k
Entering edit mode
0

I got the result using command from previous post (C: To get number of hits from blastp output file) as below:

(for i in *.tsv; do echo -ne "${i}\t" && awk '$3>=40{print $2}' ${i} | sort -u | wc -l; done) > blastp.txt

in which I have multiple blastp output (format 6) in a directory, and I want to calculate the number of hits with sequence identity of more than 40% for each output file.

I am not sure is it separate by tab. I am so sorry for that.

ADD REPLYlink 3 months ago
fec2
• 20
Entering edit mode
0

Hi fec2,

From echo -ne "${i}\t" we know that the results will be tab (\t) separated, and we can use man echo to check out the meaning of parameters: -ne. Or, you can use cat -T blastp.txt to check, \t will be revealed as ^I (I really suggest you make it a chance to learn and to know what each component in the commands is for, instead of just copy/paste).

In your original question, it was: A_B 123, while in your "actual" dataset it is something like A_B.tsv 123. I would suggest to first re-format your "actual" data to keep only prefix (species?) and use tab to separate them (because _ is also in your prefix which makes things complex a bit). Finally, you can change _ from the delimiters to \t for awk. This should get you there:

sed "s/.fasta_/\t/g; s/.fasta.tsv//g" blastp.txt > blastp_re.txt
sed "s/.fasta//g" prot_numbers.txt > prot_numbers_re.txt

awk -F'[:\t]' 'FILENAME=="prot_numbers_re.txt" {prot[$1]=$2; next} {a[1]=$1; a[2]=$2; asort(a); num[a[1]"-"a[2]]+=$3; deno[a[1]"-"a[2]]=prot[a[1]]+prot[a[2]]} END {for (pair in num){print pair, num[pair]*100/deno[pair]}}' prot_numbers_re.txt blastp_re.txt
ADD REPLYlink 3 months ago
SMK
♦ 1.3k
Entering edit mode
0

Hi SMK,

Thanks for your advice, I will learn it for the future use. However, I still got error after reformat the .txt file :

awk: calling undefined function asort
input record number 1, file blastp_re.txt
source line number 1

Any suggestion? Really thanks for your help.

ADD REPLYlink 3 months ago
fec2
• 20
Entering edit mode
1

Hi fec2,

Are you using the default awk in OSX? The commands (sed, awk) are based on the GNU Project's implementation, which should work in Linux (GNU) environment.

If you're in OSX, you can first do: brew install gnu-sed gawk (check out homebrew), and change the codes to:

gsed "s/.fasta_/\t/g; s/.fasta.tsv//g" blastp.txt > blastp_re.txt
gsed "s/.fasta//g" prot_numbers.txt > prot_numbers_re.txt

gawk -F'[:\t]' 'FILENAME=="prot_numbers_re.txt" {prot[$1]=$2; next} {a[1]=$1; a[2]=$2; asort(a); num[a[1]"-"a[2]]+=$3; deno[a[1]"-"a[2]]=prot[a[1]]+prot[a[2]]} END {for (pair in num){print pair, num[pair]*100/deno[pair]}}' prot_numbers_re.txt blastp_re.txt

Check here for more information about AWK implementations.

ADD REPLYlink 3 months ago
SMK
♦ 1.3k
Entering edit mode
0

Thank you very much! The command is working fine now.

ADD REPLYlink 3 months ago
fec2
• 20
Entering edit mode
0

That's great, glad it helps! :-)

ADD REPLYlink 3 months ago
SMK
♦ 1.3k
Entering edit mode
0

Hi,

I wonder is it possible to have the result in matrix format? Thank you.

ADD REPLYlink 3 months ago
fec2
• 20
Entering edit mode
0

Can you provide an example?

ADD REPLYlink 3 months ago
SMK
♦ 1.3k
Entering edit mode
0

For example:

              A      B       C    
       A     100  
       B     95     100     74      
       C     58     78      100
ADD REPLYlink 3 months ago
fec2
• 20
Entering edit mode
0

If still sticking to what was asked at the beginning:

Is it possible to have a command...

You can try changing the code to:

$ cat blastp.txt
A_B 123
B_A 124
A_C 123
C_A 125
B_C 123
C_B 126
$ cat prot_numbers.txt
A:1234
B:1234
C:1234
$ awk -F'[:_ ]' 'FILENAME=="prot_numbers.txt" {prot[$1]=$2; next} {a[1]=$1; a[2]=$2; asort(a); num[a[1]"-"a[2]]+=$3; deno[a[1]"-"a[2]]=prot[a[1]]+prot[a[2]]} END {for (i in prot) {for (j in prot) {pair=i"-"j; if (pair in num) {printf num[pair]*100/deno[pair] "\t"} else {printf ".\t"}}; printf "\n"}}' prot_numbers.txt blastp.txt
.   10.0081 10.0486
.   .   10.0891
.   .   .

Because each pair is sorted alphabetically (A-B: row-col), here it only shows A-B, A-C, and B-C.

ADD REPLYlink 3 months ago
SMK
♦ 1.3k
Entering edit mode
0

Hi,

I got an error as below:

    cmd. line:1: (FILENAME=blastp_re.txt FNR=6) fatal: division by zero attempted
ADD REPLYlink 3 months ago
fec2
• 20
Entering edit mode
0

What was the input?

ADD REPLYlink 3 months ago
SMK
♦ 1.3k
Entering edit mode
0

I m sorry, I missed out something important, the input blastp.txt should be:

   A          B            123
   B         C             321

And is separate by tab. I am sorry that I missed out this part

Is not A_B and B_C.

ADD REPLYlink 3 months ago
fec2
• 20
Entering edit mode
0

I see. As I explained before:

-F'[:_ ]' uses ":", "_", and " " (space) as delimiters

If the inputs are:

$ cat blastp.txt
A   B   123
B   A   124
A   C   123
C   A   125
B   C   123
C   B   126
$ cat prot_numbers.txt
A:1234
B:1234
C:1234

Changing the command to this should work:

awk -F'[:\t]' 'FILENAME=="prot_numbers.txt" {prot[$1]=$2; next} {a[1]=$1; a[2]=$2; asort(a); num[a[1]"-"a[2]]+=$3; deno[a[1]"-"a[2]]=prot[a[1]]+prot[a[2]]} END {for (i in prot) {for (j in prot) {pair=i"-"j; if (pair in num) {printf num[pair]*100/deno[pair] "\t"} else {printf ".\t"}}; printf "\n"}}' prot_numbers.txt blastp.txt
ADD REPLYlink 3 months ago
SMK
♦ 1.3k
Entering edit mode
0

Hi,

Thank you. However, the matrix doesn't have any label. Only has number and “.”. Could it be possible to have a label? Thanks again.

ADD REPLYlink 3 months ago
fec2
• 20
Entering edit mode
1

Alright, this should give you the labels (glad that it's still "a" command):

awk -F'[:\t]' 'FILENAME=="prot_numbers.txt" {prot[$1]=$2; next} {a[1]=$1; a[2]=$2; asort(a); num[a[1]"-"a[2]]+=$3; deno[a[1]"-"a[2]]=prot[a[1]]+prot[a[2]]} END {for (col in prot) {printf "\t" col}; printf "\n"; for (row in prot) {printf row; for (col in prot) {pair=row"-"col; if (pair in num) {printf "\t" num[pair]*100/deno[pair]} else {printf "\t."}}; printf "\n"}}' prot_numbers.txt blastp.txt

When inputs are:

$ cat blastp.txt
A   B   123
B   A   124
A   C   123
C   A   125
B   C   123
C   B   126
$ cat prot_numbers.txt
A:1234
B:1234
C:1234

It should return...

    A   B   C
A   .   10.0081 10.0486
B   .   .   10.0891
C   .   .   .
ADD REPLYlink 3 months ago
SMK
♦ 1.3k
Entering edit mode
0

Thank you very much!

ADD REPLYlink 3 months ago
fec2
• 20
Entering edit mode
0

Could it be possible that the dataset has no self comparison? For example no A-A, B-B and C-C?

ADD REPLYlink 3 months ago
fec2
• 20
Entering edit mode
0

When no pair being found it prints “.”.

ADD REPLYlink 3 months ago
SMK
♦ 1.3k

Login before adding your answer.

Powered by the version 1.5