How to get fasta sequences of features from a GFF3 file?
3
1
Entering edit mode
6.7 years ago
Lina F ▴ 200

Hi all,

I am interested in pulling out "CDS" fasta sequences from a GFF3 file. My GFF3 file lists one or more fasta sequences at the end.

I looked into using the gffutils python package but according to this documentation, it does not deal with fasta sequences listed at the end of the GFF3 file.

Am I stuck writing my own parser, or is there a better solution?

Thanks for any suggestions!

EDITED to add a snippet of my GFF file:

##gff-version 3
##sequence-region Consensus_10_consensus_sequence_1 1 645439
Consensus_10_consensus_sequence_1   feature gene    551 2104    .   +   .   name=gene
Consensus_10_consensus_sequence_1   feature CDS 551 2104    .   +   0   name=YALI0C06512p CDS
...
Consensus_9_consensus_sequence_4    feature mRNA    9625    9891    .   +   .   name=mRNA
Consensus_9_consensus_sequence_4    feature CDS 9625    9891    .   +   0   name=YALI0A21351p CDS
##FASTA
>Consensus_10_consensus_sequence_1 <unknown description>
TGCCACCTCCAAATTAACTCTCGCTTATTTCTTGTACCTGTCATATCACGTGATGTAGCT
TCCCAATCAAGAGCGGATCCTGCCTGTTTGGCTGCGTGGGTTTGCGTCTTCTTTCCGTTT
GAAGCAGTGGTATTATTCCCCCATTGTGCCAAAAGTAATGCTGAAAAGATGCCCACGAAT
>Consensus_10_consensus_sequence_2 <unknown description>
CCGAAACCACAGCCATGATCAGAGTCACTCCTATTCCAACCCCGCCAACTGCCGTGGGGT
ACACGTTATATTGGGTAGGTGTGTATCCTTGAGACTTGAGCCAGGAGATCATGGAAGGTT
GGGTGCTGGCAATACAGGTGTTATTGTAGCAAAGAAAGATGAGAGAGAAGAAATAAATGT
GCCAGGTTTTCAAAGAGCGTTTCAAAACTTGCAAGAACGGCTCTTCTTTGTTACCGGTAT
CGCCGATTTGTGCTCTTCTCTCCTTGGCTATAGCAATGTCTTCCTTGGTGAAGTACCAGC
TGGTAGTTGTCTCCGGTGTGTTTGGGTTGACGAACATGGTGTAAAGTGCCACTGGAAATG
>Consensus_10_consensus_sequence_2 <unknown description>
CACTCTCAAGCATTTAGGAACTTGTCAAGAGGTTCAAAGGTTGGAACTTGCAACTGAACT
GATCGCAACATAATCACCGCTATTAAACCCTGATTACAAGTGCTTCTGATTGCATCCACA
GTTCATTTCCATGGGCTAGGCTATACGAAAATACAAGGATTAGAAACTATATACAATTGA
CTCTGCAATCTTTCCCGCTAAACGGTGGTGTGGTTATGACCTGGCTCGTGTTCATGGCCG
...
gff3 fasta parser • 8.0k views
ADD COMMENT
0
Entering edit mode

Please post a snippet of your GFF3 file.

ADD REPLY
5
Entering edit mode
6.7 years ago

You should just split your file into two separate files, a GFF and FASTA. Then you can do:

ADD COMMENT
0
Entering edit mode

this approach seemed easiest -- thanks!

ADD REPLY
0
Entering edit mode

How would you translate it to the amino acid sequence?

ADD REPLY
0
Entering edit mode

Biopython? They have plethora of methods mimicking the central dogma essentially

ADD REPLY
0
Entering edit mode
6.7 years ago
prasundutta87 ▴ 660

As you have the coordinates of your sequence of interest, you can use samtools faidx.

You first have to make an index of your reference genome using samtools faidx itself after which you can then use the same command to extract your sequence of interest..the usage example available from the command line of samtools faidx is helpful..

ADD COMMENT
0
Entering edit mode

If the FASTA sequence is already in the attributes field, it may be a simple parsing step to excise that info.

ADD REPLY
0
Entering edit mode
6.7 years ago

bedtools getfasta can do that

use -s to force strandness

http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html

bedtools getfasta -fi yourgenome.fa -fo out.fa -bed yourgff.bed -s

ADD COMMENT

Login before adding your answer.

Traffic: 1560 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6