Biostar Beta. Not for public use.
Filtering BLAST results
0
Entering edit mode
14 months ago
flogin • 110
FioCruz/Brazil

I have a csv file like this:

subject,query

VBAnDirWR_KB672490,INVdsDNA_NC_029032

VBAnDirWR_KB672490,VIRdsDNA_NC_029032

VBAnDirWR_KB672490,VIRdsDNA_NC_029032

VBAnDirWR_KB672490,VIRdsDNA_NC_029032

VBAnDirWR_KB672490,VIRdsDNA_NC_029032

VBAnDirWR_KB672490,BACdsDNA_NC_029032

VBAnDirWR_KB672491,VIRdsDNA_NC_029032

VBAnDirWR_KB672491,VIRdsDNA_NC_029032

VBAnDirWR_KB672491,BACdsDNA_NC_029032

VBAnDirWR_KB672491,BACdsDNA_NC_029032

VBAnDirWR_KB672491,INVdsDNA_NC_029032

VBAnDirWR_KB672492,BACdsDNA_NC_029032

VBAnDirWR_KB672492,VIRdsDNA_NC_029032

VBAnDirWR_KB672492,VIRdsDNA_NC_029032

VBAnDirWR_KB672492,VIRdsDNA_NC_029032

VBAnDirWR_KB672492,VIRdsDNA_NC_029032

As we can see, are 5 queries per subject (results of -num_alignments 5, in BLAST), I need recovery only subjects IDs that have the 'VIR' tag in 3 or more queries, so I wrote (or I tried wrote xD), a python script to do this:

import pandas as pd
list_subjects = []
vir = 0
count = 0

# loading archive...
blast = pd.read_csv('blast_teste.csv')
# converting in a data frame...
blast_df = pd.DataFrame(blast)
# separing columns in subject and query
subject = blast_df.iloc[:,0]
query = blast_df.iloc[:,1]
# reading subject and query at the same time...
for i,c in zip(subject,query): 
    if count < 5: # to count viral tag each five elements.
        if vir < 3 and 'VIR' in c: # if tag in query
            vir += 1 
            count += 1
        elif vir >= 3: # if each 5 queries, 3 are viral, retrieve the subject ID
            list_subjects.append(i)
            count += 1
        else:
            count += 1
    elif count == 5: # when count reach in five, the count is reseted to start a new step of analysis
        count == 0
        vir == 0
print(list_subjects)

but this code returns: ['VBAnDirWR_KB672490'] and should be ['VBAnDirWR_KB672490','VBAnDirWR_KB672492'] (because those 2 subjects have more than 3 queries with 'VIR' tag.

can anybody help me?

ADD COMMENTlink
0
Entering edit mode

Thanks guys, both suggestions are working!

Best,

ADD REPLYlink
4
Entering edit mode
15 months ago
h.mon 25k
Brazil

In Perl, using the same logic suggested by Carambakaracho .

#!/usr/bin/perl

use strict;
use warnings;

my %filter = ();
while (<>){
    my( $subject, $query ) = split ",";
    if( $query =~ /^VIR/ ){ $filter{$subject}++; }
}

foreach my $key ( keys %filter ){
    if( $filter{$key} > 2 ){ print "$key\n" };
}
ADD COMMENTlink
3
Entering edit mode
14 months ago
Carambakaracho ♦ 1.2k
Switzerland/Basel

This looks much like some sort of constructed problem, does it? (Not to say "Homework") In any way, you're half way there - You need to reset the count variable, though it's quite useless. Simpler convert list_subjects to a dictionary, subject as key, 'VIR' count as value, increment each 'VIR' count per subject, return subjects with count >= 3

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3