Question

BioPython: how to edit a fasta file to include a large deletion

2

Entering edit mode

8.1 years ago

skbrimer ▴ 740

Morning Gang,

I feel like I am just missing something in the documentation for this but here goes, I'm currently running some salmonellas through the works using a very well annotated reference and we have come across a large drop out of genes in one location. We went back and confirmed they are not there with a quick pcr so I know the data is correct, however when I go to make the consensus chromosome I can not figure out how to include the drop out in the file.

For the mapping I'm using TMAP (ion torrent data, single end reads), freebayes for the haploid vcf creation, and SnpEff for the SNP annotation.

I know if I want to create a new chromosome of our sample I can use the vcf file to make the SNP changes in the reference file, however that will retain the genes that have dropped out because I'm making the changes to the reference not the alignment.

Does anyone have a suggestion or a work around for this type of problem?

Thank you in advance, Sean

EDIT 15MAR2016

So I have been working on this for some time now and I'm close to my solution but I'm having an issue the "new" sequence. Below is my code.

    ## this makes a dict of the samtools depth coverage input file of all the 0's
    coverageDict = {}

    ## this loops over the input depth information and appends the dictionary with 
    ## a key,value as genome position, depth of coverage
    coverage = open("/home/sbrimer/Desktop/Columbia ST/1819_1/coverage.txt","r")
    for line in coverage:
        coverages = line.split("\t")
        coverageDict[coverages[1]] = coverages[2]
    coverage.close()

   ## this makes a set of the large dict of just the missing.   
    missing = {index for index,value in coverageDict.items() if value == 0}

    def filter_coverage(index,value):
        return 'N' if index in missing else value

    ref = open("Salmonella enterica subsp.fasta","rU")
    ## this should read in the sequence file as an object, change it to a mutable
    ## object, then anywhere the coverageDict value = 0. Replace that nt with "N"
    for seq_record in SeqIO.parse(ref,"fasta"):
        refSeq = seq_record.seq
        new_seq = "".join(filter_coverage(index,value) for index,value in enumerate(refSeq.tomutable(), start=1))
        new_rec = SeqRecord(Seq(new_seq),id="new_id",description="new_description")
        SeqIO.write(new_rec,"NewSeq.fasta","fasta")

    ref.close()

Everything seems to work fine until I want to add the 'N' to the fasta file, I can get the generic "new_id" and "new_description" to show up in the test file, but not the newseq. I have also tried the following:

first this

refSeq = seq_reocrd.seq
refSeq_mut = refSeq.tomutable()

then

refSeq = str(seq_record.seq)

but for whatever reason I can not seem to change the seq_record.seq part of the record. Has anyone else had this issue?

Assembly next-gen alignment sequence biopython • 4.5k views

ADD COMMENT • link updated 5.5 years ago by Leonor Palmeira 3.8k • written 8.1 years ago by skbrimer ▴ 740

1

Entering edit mode

You could replace the dropped region with a stretch of N's to keep the length consistent.

ADD REPLY • link 8.1 years ago by GenoMax 141k

0

Entering edit mode

I thought about that as well. I know that in samtools I can use mpileup to make a consensus that will use the "N" as the placeholder and I thought about remapping to that and use the existing vcf file to annotate. I guess my biggest issue is that I can not locate the coordinates of the dropout without using a genome viewer.

ADD REPLY • link 8.1 years ago by skbrimer ▴ 740

0

Entering edit mode

Good edit! Before you didn't mention Biopython if I remember correctly. Please see my hints to improve the post below (so please, no offence):

People (also me) might not understand 'drop out', did you mean 'deletion', or is it an IonTorrent specific term?
Be short and specific, and to the point, especially in the title, phrase the title like a question. The more the question is stripped to the core, the more people will possibly look at it. (Some people not working with bacteria might be able to help as well)
'Need help' is not necessary.
Is it relevant that it is a bacterial genome or is it only relevant that it is a fasta file?
I would just frame it 'BioPython: how to insert a stretch of N's into a fasta record?' or 'BioPython: how to edit a fasta file to include a large deletion'.
The XY problem: is it really a good solution to add N's (maybe it was a very good idea by the commenter)? N's signify there is some sequence there but it is unknown what it is, if it is a deletion, it might be better to simply remove the bases.
It is very good you included your code, could you also provide a reproducible example, e.g. create a small dataset including the coverage file and a fasta file. If everyone can run your code with ease, then more people will run your example. The more self-contained the example (no downloading of files), the more people will simply copy-paste and debug for free.

(You do not open or close the output file, is it not necessary? It looks like you are overwriting a single record over the existing file in each iteration. But I am not a python expert, python files might be smarter than perl files ;) )

ADD REPLY • link 8.1 years ago by Michael 54k

0

Entering edit mode

Thanks Michael, I really appreciate the input!

Yes, I do mean deletion when I say drop out. It's not an IonTorrent term.
It is only relevant that it's a fasta file
I thought about that as well, but my supervisor wants 'N' in there so she can see the deletions easly. We were having issues with an out of the box solution over writing deletions with the reference file so when we made the SNP calls and exported it, it had a full genome. Which made us wonder about how many other things we pushed out incorrectly. Woot learning curves.
I think you may be onto something when you pointed out my loop rewriting itself. I will look into that as well.

Thank you again :)

ADD REPLY • link 8.1 years ago by skbrimer ▴ 740

0

Entering edit mode

8.1 years ago

Michael 54k

Disclaimer, I have not programmed any python ever in my life ....

... but: I think you are using Bio.SeqIO.write wrongly

From http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html

Use the function Bio.SeqIO.write(...), which takes a complete set of SeqRecord objects (either as a list, or an iterator), an output file handle (or in recent versions of Biopython an output filename as a string) and of course the file format:

from Bio import SeqIO
records = ...
SeqIO.write(records, "example.faa", "fasta")

Or, using a handle:

from Bio import SeqIO
records = ...
with open("example.faa", "w") as handle:
  SeqIO.write(records, handle, "fasta")

You are expected to call this function once (with all your records) and if using a handle, make sure you close it to flush the data to the hard disk.

So your relevant portion might look similar (sorry for any Perl style syntax you find);

records = []; #make an empty array
for seq_record in SeqIO.parse(ref,"fasta"):
        refSeq = seq_record.seq
        new_seq = "".join(filter_coverage(index,value) for index,value in enumerate(refSeq.tomutable(), start=1))
        new_rec = SeqRecord(Seq(new_seq),id="new_id",description="new_description")
        records.append(new_rec);

SeqIO.write(records, "NewSeq.fasta","fasta")

ADD COMMENT • link 8.1 years ago by Michael 54k

0

Entering edit mode

Thank you for your help Michael, but I am still unable to get the expected output. I will keep working on it. I'm sure it has something to due when calling the seq object, I must be doing it wrong somehow since it is not being edited. The filter_coverage function seems to be working when I make fake sequences and matching covergeDicts. It only seems to break down when I also include the Seq class. The id and description change, but not the sequence. Frustrating, but I will get it.

ADD REPLY • link 8.1 years ago by skbrimer ▴ 740

0

Entering edit mode

Maybe you should have a python crack look at this, I have a hard time figuring out what the line starting with new_seq =even does. Remember to post a small reproducible example, too.

ADD REPLY • link 8.1 years ago by Michael 54k

0

Entering edit mode

the new_seq= line joins all the letters together without spaces. It's the way python wants to write it now. It use to be new_seq.join() but the deprecated it and now they want people to use this way.

ADD REPLY • link 8.1 years ago by skbrimer ▴ 740

score 2 · Accepted Answer · 2016-03-17

Hi All,

It turns out the error in this scrip is right at the top. When I make the dictionary the index and value are encode as strings and not numbers so when all the other stuff runs, it runs without error and returns the original sequence because number do not match words.

Seriously. That took hours for me to figure out.

Lesson learned.

Please find the corrected script below if for some reason anyone else needs to do something like this.

All that is left to do with it is ti set up the argparser for the options I want so it will work with any file I pass.

from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
from Bio.Seq import Seq
from Bio import SeqIO
import argparse 

## this makes a dict of the samtools depth coverage input file of all the 0's
coverageDict = {} ## first the empty dictionary

## this loops over the input depth information and appends the dictionary with 
## a key,value as genome position, depth of coverage
coverage = open("/home/sbrimer/Desktop/Columbia ST/1819_1/coverage.txt","r")
for line in coverage:
    coverages = line.strip().split('\t') #added strip, values also had newline character 
    coverageDict[int(coverages[1])] = int(coverages[2])
coverage.close()

## making a set of only the index of the missing
missing = {index for index,value in coverageDict.items() if value == 0}

def filter_coverage(index,value):
        if index in missing:
            return "N"
        else:
            return value 

## this should read in the sequence file as an oblect, change it to a mutable
## object, then anywhere the coverageDict value = 0. Replace that nt with "N"
append_handle = open("NewFile.fasta","a")
for seq_record in SeqIO.parse("Salmonella enterica subsp.fasta","fasta"):
    newSeq= "".join(filter_coverage(index,value)for index,value in enumerate(seq_record.seq, start= 1))
    SeqIO.write(SeqRecord(Seq(newSeq),id="something",description="something_else"),append_handle,"fasta")
append_handle.close()