GFA to Fasta file
2
6
Entering edit mode
8.4 years ago

Hey guys,

Does anyone have a script (or a awk line) to extract fasta reads from a GFA file?

I've run this miniasm in some Pacbio reads I have, and I would like to extract the final unitigs from the final file I have (which is a GFA file) in the fasta format!

Thanks a lot guys!

sequence • 16k views
ADD COMMENT
1
Entering edit mode

Awesome!!

Thank you so much!

ADD REPLY
0
Entering edit mode

Hi,

could you describe what does GFA stand for? I don't know about this format.

ADD REPLY
0
Entering edit mode

GFA is the Graphical Fragment Assembly format. Here is a post by Heng Li (@lh3) on this: https://lh3.github.io/2014/07/19/a-proposal-of-the-grapical-fragment-assembly-format.

ADD REPLY
13
Entering edit mode
8.4 years ago
lh3 33k
awk '/^S/{print ">"$2"\n"$3}' in.gfa | fold > out.fa
ADD COMMENT
1
Entering edit mode

First, I'm sorry for posting a comment here after so many years!

I just started playing with PacBio reads though, and it's the first time I came across the gfa format... So it appears that all you need is the "S" lines, but I was wondering whether this is entirely true. I mean there are those "L" lines as well, linking certain segments together. Shouldn't we take into account these links and produce a fasta file of linked segments?

ADD REPLY
0
Entering edit mode

Usually the links describe ambiguous points in the graph where the assembler has not been able to decide on the correct path between 2 or more segments.

If you want to join contigs/segments together before exporting to fasta you should use a scaffolding tool such as:

  • GraphUnzip (Detangles graph, can be used before scaffolding)
  • DENTIST
  • SAMBA
  • ntLINK
  • SLR
  • SGTK

Those use long reads (some also use Hi-C). Other good scaffolders are available that only use Hi-C data.

ADD REPLY
0
Entering edit mode
3 months ago
Adam Taranto ▴ 40

Slightly modifying Heng's answer: If your sequence name is longer than 80 chars using fold will wrap the header onto a new line causing part of the header to be incorrectly read as sequence.

To avoid this we can first print the fasta header line and then separately wrap the sequence lines with fold.

Support long sequence names

awk '/^S/{print ">"$2; printf "%s", $3 | "fold -w 80"; close("fold -w 80"); print ""}' in.gfa > out.fa

Include GFA tags in fasta header line

If you also want to preserve the gfa segment tags (i.e. segment length LN, or depth DP) in the fasta description you can append them to the headers:

awk '/^S/{header=">"$2; for(i=4; i<=NF; i++) {header=header" "$i}; print header; printf "%s", $3 | "fold -w 80"; close("fold -w 80"); print ""}' in.gfa > out.fa

Note: close() is required to prevent the headers and sequences from getting out of sync or being skipped when sequences are very long.

ADD COMMENT

Login before adding your answer.

Traffic: 1709 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6