Question

Beginner in Python needs assistance with counting 'A's in DNA sequences

1

Entering edit mode

7.4 years ago

oki4 ▴ 10

Objective: Write a program that counts the number of A's in a DNA sequence. The input is one sequence in FASTA format in a file called 'dna.txt'.

My input file:

>Human
AGCATGCATCGATCGATCGACTAGCTAGCG
>Chimp
GATATGTCGAGATCGTCAGCTCGATCAGCT
>Gorilla
TGTGTCGATCTCGAGCTGAGTCGTCTATCA

Output: I don't need to create only print output info

My code so far:

DNA_list = []
with open('dna.fasta.py') as f_in:
    lines = f_in.read()
    if lines[0] == 'A' or 'C' or 'G' or 'T':
        DNA_list = DNA_list.append(lines)
    else:
        pass

What I would like to do and Issue:

After creating the list of DNA sequences/strings , for each sequence, I would like would like to count the 'A's using the count method and then I would just print to IDLE. The problem is that IDLE is saying that there is 'None' in the DNA_list. I don't know why the sequences were not appended to my list.

I have no idea what I'm doing wrong and would like to know if I'm heading in the right direction and how to fix my program.

Thanks.

python • 9.6k views

ADD COMMENT • link updated 7.4 years ago by WouterDeCoster 47k • written 7.4 years ago by oki4 ▴ 10

0

Entering edit mode

You're not iterating over the file. Add

for line in lines:
    if line[0] .. etc.

and retry.

ADD REPLY • link 7.4 years ago by Zaag ▴ 860

score 1 · Answer 1 · 2016-12-05

1

Entering edit mode

7.4 years ago

valerie ▴ 100

I would have done this:

DNA_list = []
f_in = open('dna.txt', 'r');
for line in f_in:
    if line[0] != '>':
        DNA_list = DNA_list.append(line)
number_of_a = DNA_list.count('A')
print(number_of_a)

ADD COMMENT • link 7.4 years ago by valerie ▴ 100

2

Entering edit mode

That code does not scale well. There is no reason to append anything; just count the number of As in each line and increment a counter, instead.

ADD REPLY • link 7.4 years ago by Brian Bushnell 20k

1

Entering edit mode

This is a very unpythonic line: DNA_list = DNA_list.append(line) I very much doubt if this code will even work, because you are assigning the result of append?

And indeed, as written by Brian, there is no need to keep all sequence into memory. You should add to a counter while looping over the file.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

score 1 · Answer 2 · 2016-12-05

1

Entering edit mode

7.4 years ago

WouterDeCoster 47k

Try to use biopython for jobs involving sequences. As such you avoid dangerous assumptions about the structure of your file, and you are quite sure that you are using the most optimal method. A great resource is the biopython cookbook

I would use the following solution, you could save this code as A_counter.py (for example) and execute it in a terminal as python A_counter.py yourfile.fasta

import sys
from Bio import SeqIO

counter = 0
for record in SeqIO.parse(sys.argv[1]. "fasta")
    counter += record.seq.count('A')

You could even write this as:

counts = sum([record.seq.count('A') for record in SeqIO.parse(sys.argv[1]. "fasta")])

but don't worry about lovely constructions like this when you just started with python. It's called a list comprehension and it's awesome.

Some further remarks/feedback on your code:

DNA_list = []
with open('dna.fasta.py') as f_in:
    lines = f_in.read()
    if lines[0] == 'A' or 'C' or 'G' or 'T':
        DNA_list = DNA_list.append(lines)
    else:
        pass

f_in.read() just reads the first line of the file, you are not looping over the file. You won't get all records. That's probably why you got 'None'. The following would have worked:

DNA_list = []
with open('dna.fasta.py') as f_in:
    for line in f_in.read()
        if line[0] == 'A' or 'C' or 'G' or 'T':
             DNA_list = DNA_list.append(lines)

But note that a sequence could also start with an 'N'. In addition, you are now storing all lines in a list. That may work fine for a small file, but will get problematic for large files because every line has to fit in memory. Instead, I loop over the file and only keep track of the A-counts. Finally, there is no need for else: pass, you can just leave it. If the if condition isn't True, the if-block will not run.

ADD COMMENT • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks when I run this code, the one you wrote up above, I get an error message Code: DNA_list = [] with open('dna.fasta.py') as f_in: for line in f_in.read(): if line[0] == 'A' or 'C' or 'G' or 'T': DNA_list = DNA_list.append(line)

Error message: Traceback (most recent call last): File "/Users/Documents/bnfo135/lab_08.py", line 216, in <module> DNA_list = DNA_list.append(line) AttributeError: 'NoneType' object has no attribute 'append'

Does that mean I can't append string to the list? I thought that was allowed in python.

ADD REPLY • link 7.4 years ago by oki4 ▴ 10

0

Entering edit mode

Also we have not learned sys.argv yet so I can't use that method. Thanks

ADD REPLY • link 7.4 years ago by oki4 ▴ 10

0

Entering edit mode

You can replace the sys.argv[1] with the name of your file.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

This particular issue was pointed out by others. The problem is the following line:

DNA_list = DNA_list.append(lines)

The .append() part modifies DNA_list in place and nothing is returned. You're then setting that nothing that's returned to DNA_list, resulting in it being None. Since None isn't a list, you can't then append to it. Change the line to:

DNA_list.append(lines)

Of course it'd be much more efficient to not use lists at all and instead count the bases in each line.

ADD REPLY • link 7.4 years ago by Devon Ryan 104k

0

Entering edit mode

DNA_list = DNA_list.append(line) is part of your code, not mine. For the explanation of the error, see the comment of Devon Ryan

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

score 0 · Answer 3 · 2016-12-05

0

Entering edit mode

7.4 years ago

Devon Ryan 104k

Hint: with will never terminate, don't use it. Use a different flow control structure instead (e.g., for).

Also, biopython is your friend.

ADD COMMENT • link 7.4 years ago by Devon Ryan 104k

0

Entering edit mode

You mean the with construction isn't a good method to open files?

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

1

Entering edit mode

No, since this appears to be a homework problem I'm trying to not explicitly give the answer. The initial problem was incorrectly using with (there are others obviously).

ADD REPLY • link 7.4 years ago by Devon Ryan 104k

0

Entering edit mode

Right. Well given that it's essentially a one-liner it's hard not to explicitly give the answer :( I failed.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

score 0 · Answer 4 · 2016-12-05

0

Entering edit mode

7.4 years ago

vmicrobio ▴ 290

with this code, you may anticipe the calculation of other counts

nbA = nbC = nbG = nbT = nbTotal = 0

for nucleotide in adn:
    if nucleotide == 'A':
        nbA += 1
    elif nucleotide == 'C':
        nbC += 1
    elif nucleotide == 'G':
        nbG += 1
    elif nucleotide == 'T':
        nbT += 1
    nbTotal += 1
print("A = ", nbA)

ADD COMMENT • link 7.4 years ago by vmicrobio ▴ 290

1

Entering edit mode

You could also rewrite this one using a dictionary:

countDict ={'A': 0, 'T': 0, 'G': 0, 'C': 0}
for record in SeqIO.parse("yourfasta.fa", "fasta"):
    for nucleotide in record.seq:
        countDict[nucleotide] += 1

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

You do not have to use all the elseifs, you can just use adn.count('A') to count all the A in adn string. You will have only 4 rows in your code instead of this long code to count the number of all the nucleotides.

ADD REPLY • link 7.4 years ago by valerie ▴ 100

0

Entering edit mode

Thanks what does adn stand for?

ADD REPLY • link 7.4 years ago by oki4 ▴ 10

0

Entering edit mode

'adn' stands for the input sequence file (you have to define it)

ADD REPLY • link 7.4 years ago by vmicrobio ▴ 290