Altering a python script to calculate GC content

0

Entering edit mode

3.2 years ago

dk0319 ▴ 70

#!/home/usr/anaconda3/bin/python3
import gzip
import sys
input_file, sequence_name, output_file = sys.argv[1:]
result = None
with gzip.open(input_file, "rt", encoding="utf-8") as file_in:
      for line_number, raw_line in enumerate(file_in):
            if raw_line.startswith(f">{sequence_name}")
                result = line_number
                break
if result is not None:
      with open(output_file, "a") as file_out:
            print(result, file=file_out)

I am interested in modifying this script so that when I input a sequence name that corresponds to a multisequence fasta file it will return the GC content as a ratio. I can easily do the math portion e.g G=input.count("G"), etc.

Where I am struggling is designating the actual sequence as my input. My fasta sequences are in this format

PDBU01061878.1 Homo sapiens CAAPA_61878, whole genome shotgun sequence
ATGGAATGGATTCGAATGGAATGGAATAGATTGGAATCAAACAGAGTGGAATGGAATGCAATGAAATGGAATCGAAT

Any help would be appreciated

sequence • 465 views

ADD COMMENT • link 3.2 years ago by dk0319 ▴ 70

Login before adding your answer.