Question

Taking information from a .txt and converting it to fasta

0

Entering edit mode

5.2 years ago

brett.plemons • 0

Okay so i have a .txt file that contains a DNA sequence. i want to remove that sequence from the file and create a .fasta.

i know how to do it using python if it werent for the fact that the first 8 lines in the .txt are as follows;

LOCUS: SCU49845
ACCESSION: U49845
ORGANISM: Saccharomyces cerevisiae (baker's yeast)          
AUTHORS: Roemer,T., Madden,K., Chang,J. and Snyder,M.
TITLE: Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein
JOURNAL: Genes Dev. 10 (7), 777-793 (1996)
PUBMED: 8846915
SOURCE: https://www.ncbi.nlm.nih.gov/nuccore/U49845.1?report=genbank&to=5028
GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAG
ACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAA

there are quite a few more lines of the sequence obviously but this is how it is formatted. it looks like genbank but it is in a .txt i am not sure how i can do that conversion either.

I was able to write the code to take every line in the txt document and make them individual items in the list, but i cant just use a function that removes the first 8 lines because the code needs to be usable in other documents that may differ from this one. So how do i just search for the nucleotide sequences?

edit: didnt realize the formatting reverted. The file i am using is *.txt

python • 3.9k views

ADD COMMENT • link updated 5.2 years ago by patelk26 ▴ 290 • written 5.2 years ago by brett.plemons • 0

1

Entering edit mode

Why not just parse the ACCESSION lines, get the accession and download the sequences in FASTA format using Entrez Direct? Do you expect these sequences to be somehow different from the versions that are public?

ADD REPLY • link 5.2 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

Can you update your post to include the format of the sequences in the context of the header too?

ADD REPLY • link 5.2 years ago by Joe 21k

0

Entering edit mode

That looks like a part of GenBank format file, which is text based. Only some of the lines appear to have been retained.

ADD REPLY • link 5.2 years ago by GenoMax 141k

0

Entering edit mode

realistically, if there was a way to just remove the header without having to convert the file, that would also work. However, since it needs to be usable for multiple possible documents i cannot use like range(1,9)

ADD REPLY • link 5.2 years ago by brett.plemons • 0

0

Entering edit mode

You could perhaps check to see if lines have [ACTGN] and nothing else. Retain those/remove others.

ADD REPLY • link 5.2 years ago by GenoMax 141k

0

Entering edit mode

How about a customisable parameter for how many header lines to skip? Or are you hoping that the script can ‘figure this out for itself’?

ADD REPLY • link 5.2 years ago by Joe 21k

0

Entering edit mode

yeah i am hoping the script can figure that out on its own.

the other thing i came up with is

import pandas as pd
dna = pd.read_csv(sep = "\n", header = None, names = ['code'])
regex = "[^ATCG]+\\b"
filter = dna['code'].str.contains(regex)
dna = dna[~filter]

but i keep getting an error saying that

parser_f() missing 1 required positional argument: 'filepath_or_buffer'

for the dna = pd.read_csv(sep = "\n", header = None, names = ['code']) line and i am not sure what it is talking about.

ADD REPLY • link 5.2 years ago by brett.plemons • 0

0

Entering edit mode

because i didnt add my file path haha, but now im getting an unkown error for invalid syntax

ADD REPLY • link 5.2 years ago by brett.plemons • 0

4

Entering edit mode

5.2 years ago

finswimmer 16k

Not python, but maybe awk is also good.

The logic behind this is, that the first lines have fields delimited by :. So there are at least 2 fields. The sequence itself have no :, so the field number is 1.

$ awk -v FS=":" '$1=="ACCESSION" {gsub(" ", "", $2); print ">"$2} NF==1 {print}' a.txt
>U49845
GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAG
ACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAA

The gsub is used for removing white spaces.

fin swimmer

EDIT:

Here the same logic in python:

import sys

with open(sys.argv[1], "r") as genefile:
    for line in genefile:
        fields = line.split(":")
        if(fields[0] == "ACCESSION"):
            print(">"+fields[1].strip())
        if(len(fields) == 1):
            print(fields[0].strip())

Run it with python extract.py a.txt

ADD COMMENT • link 5.2 years ago by finswimmer 16k

0

Entering edit mode

that's a great use case for awk!

ADD REPLY • link 5.2 years ago by Friederike 8.9k

0

Entering edit mode

5.2 years ago

patelk26 ▴ 290

I am sure there are more efficient and concise way of doing this, but this gets you the desired result.

#!/usr/bin/Python

# file to write the results
fasta = open("resultingFasta.fa","a")

# list to capture results
result = []

with open("dna.txt") as fh:
    for line in fh:
        if(line.startswith(("LOCUS","ORGANISM","AUTHORS","TITLE","JOURNAL","PUBMED","SOURCE")) ):
               # skip for lines starting with these
               continue 

        elif(line.startswith("ACCESSION")):
              # capture accession ID
              line = line.rstrip() # skip any trailing characters
              acc = line.split(": ")    # splitting ACCESSION: U49845
              result.append(">"+acc[1])# append accession ID to a list

        else:
              # the line with nucleotide sequence
              line = line.rstrip()
              result.append(line)   # append result to a list


for x in result:
    fasta.write(x+"\n") # writing it out to a file

ADD COMMENT • link 5.2 years ago by patelk26 ▴ 290

score 1 · Accepted Answer · 2019-02-22

Well i figured it out using regular expressions

import re
with open("dna_sequence.txt") as fh:
    mydna = [line for line in fh if re.match(r'^[AGCT]+$', line)]
    #print(mydna)
complement = ""
for i in mydna:
    for x in i:
        if x == 'A':
            complement += 'T'
        elif x == 'G':
            complement += 'C'
        elif x == 'C':
            complement += 'G'
        elif x == 'T':
            complement += 'A'
print(complement)