Altering a python script to calculate GC content
1
0
Entering edit mode
3.2 years ago
dk0319 ▴ 70
#!/home/usr/anaconda3/bin/python3
import gzip
import sys
input_file, sequence_name, output_file = sys.argv[1:]
result = None
with gzip.open(input_file, "rt", encoding="utf-8") as file_in:
      for line_number, raw_line in enumerate(file_in):
            if raw_line.startswith(f">{sequence_name}")
                result = line_number
                break
if result is not None:
      with open(output_file, "a") as file_out:
            print(result, file=file_out)

I am interested in modifying this script so that when I input a sequence name that corresponds to a multisequence fasta file it will return the GC content as a ratio. I can easily do the math portion e.g G=input.count("G"), etc.

Where I am struggling is designating the actual sequence as my input. My fasta sequences are in this format

>PDBU01061878.1 Homo sapiens CAAPA_61878, whole genome shotgun sequence
ATGGAATGGATTCGAATGGAATGGAATAGATTGGAATCAAACAGAGTGGAATGGAATGCAATGAAATGGAATCGAAT

Any help would be appreciated

sequence • 1.0k views
ADD COMMENT
0
Entering edit mode
3.2 years ago
Mensur Dlakic ★ 27k

This script requires BioPython. It will print the name, length and GC content for each sequence. For your example sequence the output is:

PDBU01061878.1 77 0.3636

import os, sys
import pandas as pd
from Bio import SeqIO
from Bio.SeqUtils import GC

if os.access(sys.argv[1], os.R_OK):
    FastaFile = open(sys.argv[1], 'rU')
else:
    print('\n !!! Input file "%s" does not exist in this directory !!!\n' %
          sys.argv[1])
    sys.exit(1)

for rec in SeqIO.parse(FastaFile, 'fasta'):
    name = rec.id
    seq = rec.seq
    seqLen = len(rec)
    seqGC = GC(seq)
    print('%s %d %.4f' % (name, seqLen, seqGC/100.0))

FastaFile.close()
ADD COMMENT

Login before adding your answer.

Traffic: 3321 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6