Question

Reading colon delimited description from Fasta file possible?

0

Entering edit mode

9.0 years ago

bfeeny ▴ 50

I am working with human_g1k_v37.fasta which is found on the 1000genomes site, specifically: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/

I have parsed this into files of individual chromosomes.

The header for chromosome 1 looks like so:

>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1

So I have an index of 1 with the rest of the data as part of the description

My understanding is this header can be explained as so:

coord_system_name = chromosome
coord_system_version = GRCh37
seq_region.name = 1
seq_region.start = 1
seq_region.length = 249250621
seq_region. strand = 1

My question is, is there anything I can do in Biopython to read these values in? I am just identifying the file I am reading as a file of type "fasta". I am wondering if I must manually parse this out splitting on colon or if functions already exist in Biopython that can do this for me?

Here is an example of the code I use to read in this file:

def read_fasta_file(filename):
    handle = open(filename, "rU")
    for record in SeqIO.parse(handle, fileFormat):
        print("ID %s" % record.id)
        print("Sequence length %i" % len(record))
        print("Sequence desc %s" % record.description)
        print("Sequence alphabet %s" % record.seq.alphabet)
    handle.close()

1000genomes biopython fasta • 2.6k views

ADD COMMENT • link updated 22 months ago by Ram 43k • written 9.0 years ago by bfeeny ▴ 50

Ram · Accepted Answer · 2015-04-28

1

Entering edit mode

9.0 years ago

Devon Ryan 104k

The question is what you actually want to get from the description. There's no standard formatting for it, it can be free text (though it's structured in this case), so it won't be automagically parsed.

ADD COMMENT • link updated 22 months ago by Ram 43k • written 9.0 years ago by Devon Ryan 104k

0

Entering edit mode

Mainly to ensure its GRCh37 and which chromosome. I have seen multiple file types store "annotations", sort of like key/value pairs. I figured there may be some tools in Biopython to help me get this information from any FASTA file.

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.0 years ago by bfeeny ▴ 50

0

Entering edit mode

You'll have to just parse the description with a regex. If you read through the README file that describes what you downloaded, you'll see that it's mostly GRCh37, with the MT sequence changed.

ADD REPLY • link 9.0 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks. I want my program to be able to read a FASTA file and identify what chromosome(s) are in it, so it can do proper sequencing to the reference chromosome(s). Obviously my reference chromosome has this colon delimited header. Would a typical, if there is such a term, FASTA header have a fairly standard way to identify which chromosome is being passed in? I guess I could make the assumption that anything my program is using, is Human, and I could just read the index and totally forget about the header, does that sound right?

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.0 years ago by bfeeny ▴ 50

1

Entering edit mode

The chromosome name is what follows the ">", so chromosome 1 in your case. The remainder of what you showed is typically not present. Don't expect anything other than a chromosome name. There is no general way to tell from a fasta file what organism it came from or what version it is.

ADD REPLY • link 9.0 years ago by Devon Ryan 104k

0

Entering edit mode

Devon if you type in your basic reply as an answer I will mark it as answered, thank you for your help.