python script to sort blast results and fetch best matches from NCBI
1
0
Entering edit mode
6.3 years ago
skbrimer ▴ 740

Hello hive mind,

I have been trying to automate a process that I do for flu sequencing. I currently use SPAdes to do a de novo build and then blast the contigs using the ncbi command line, remotely. I get back a tab delimited csv file that has the contig, the sequence id, the ncbi acc # and the raw score.

I have working a small script to pull just flu contigs

import re
import csv
import sys

with open(sys.argv[1], newline="") as csv_file, open("flu_hits.csv","w") as justFlu:
    reader = csv.reader(csv_file, delimiter="\t")
    writer = csv.writer(justFlu, delimiter="\t")
    for row in reader:
        if re.match(r'influenza',row[1], re.I) != None:
           writer.writerow(row)

this works just fine, however now I want to take the flu file and select the best match for each contig so I can make a list of the accession numbers to later then use to pull from ncbi to make a reference.

My current issue is trying to pull the highest score. I have some code (below) that is working but not the way I want. It is only returning the highest score from all the contigs not the highest score for each contig.

with open(sys.argv[1], newline="") as csv_file, open("temp.csv","w") as subset:
    reader = csv.reader(csv_file, delimiter="\t")
    writer = csv.writer(subset, delimiter="\t")
    for row in reader:
        if row[0] in NodeList:
            writer.writerow(row)
    with open("temp.csv", "r") as temp:
        subset_reader = csv.reader(temp, delimiter="\t")
        BestHit = max(subset_reader, key=lambda column: int(column[-1].replace(',','')))
        print(BestHit)

I do this manually now but it take a lot longer than it should, so I would really appreciate any direction with this.

Thank you in advance, Sean

Assembly alignment sequence python • 2.3k views
ADD COMMENT
1
Entering edit mode

Blast parser in biopython?

ADD REPLY
0
Entering edit mode

see, this is why I come here. I may have been doing this the hard way

ADD REPLY
2
Entering edit mode
6.3 years ago
skbrimer ▴ 740

here is my solution!

import re
import csv
import sys
import pandas as pd

with open(sys.argv[1], newline="") as csv_file, open("flu_hits.csv","w") as justFlu:
    reader = csv.reader(csv_file, delimiter="\t")
    writer = csv.writer(justFlu, delimiter="\t")
    for row in reader:
        if re.match(r'influenza',row[1], re.I) != None:
           writer.writerow(row)

flu_matches = pd.read_csv("flu_hits.csv", sep="\t", header=None)

NodeList = {node for node in list(flu_matches[0])}

for node in NodeList:
    contig = flu_matches[0].str.contains(node)
    hits = list(flu_matches[contig].max())
    print(hits[-2])

I need to change the print function to a file to write to for further processing but it is doing want I need!

ADD COMMENT

Login before adding your answer.

Traffic: 2179 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6