Question

bedtools getfasta (how to get only cds coding sequence)

1

Entering edit mode

6.0 years ago

benjamin ▴ 10

Hi all, I'm actually using bedtools to extracte my cds in fasta format from a gff3 file. The command is : bedtools getfasta -fi scaf0_test.fa -bed run_augustus_0035.gff.out -fo fasta.out -s

But the problem is that it gives me not only the coding part (without stop codon inside but also the non-coding part)

For exemple, here is my protein sequence from de gff file:

MIAALVPREPPVGPPGRRAEGRGGIHCDATNNHTYTSPHIHSRHVHAVADRMSTLGKNLIGAPWRGVTTRGTTVSSVL
RAILQDSDKPESEHSGHACEQCGKLYKTVRTLYSHIMLKHPKDKEEVQCSVCDKKYSSALSLQKHVRYMHRYEHRCK
TCYRTFATSEMLVCHRESCYNNVSPCPVCNKIFDSRLALRNHINYNHPRSEESSVQERRQCNVCSRMFTSSRSLLNHM
AAIHPVGTTDCNLCGRTFNSMPAFRSHFLYKHGDHGVHCTKCHKLFATDTSLRRHMEKVHGKNNKPGFLCGICQTYFYK
SSDLVQHILANHKETSSD

And here is my dna sequence comming from the outfile (traducted):

   MIAALVPREPPVGPPGRRAEGRGGIHCDATNNHTYTSPHIHSRHVHAVADRMSTLGKNLIGAPWRGVTTRGTTVR*YVTNTG
   VDSEEIKDRSCKHLQYRAPLYARA*FALIRHCADATTLPNITAFGIVTLPV**YKIN*THYCGVICAQS*TALHVMTC*RDECFRFQI
    RRSLTLITMVVLIIMRTPRTIYT*TNPRNHILSP*SLRCFLTYIHVYKIYKNRSRTIRRGQVRSQYIECYIYTL*HPCLQSLSNCMRQ
    YLCYRSQKQVCIFIEDPACLSETAIFRIRCKRKRHIFTQIFTSNLLNLPTLKFQRNKYPFKFIYF**ISTNLFNCKLSTKISCKKKEIF
    FSSSSFFFSTDIYQLRRVCTYITENETRKDVNIINTLQNEC*NRYC*YRSKHSSVGN*YR*GTFAFFQFRIARHSTRLRQTGERA
    FRARL*AMRQTL*NSEDALFAYHAETPQR*RGGAVFRLR*KVLVGTESTETRALHAPL*APM*NVLQDFCNV*NASLSQGKLLQ
    QRVPMPRVQQDI*LPTCTSESYKL*SSKK*RKFGSRKEAMQRL*PHVY*FS*PPQPHGGYTPGRHD*LQFVRPHF*FHASLSKP
    FPLQAR*PRRTLHKVS*IIRNRYKSSPAHGESTRQK**TRFLVRYMPNLFL*IE*PGAAHIGKP*GDIE*L

As you can see, there are a lot of stop codon, do you know why I do not get the exact same protein sequence? The fasta dna sequence should contain two cds concatened (one with 74 bases and the other should have 256), then the total length of this sequence should be 330.

Here is an exemple of my gff file:

# gffread v0.9.9
##gff-version 3
scaffold_0  AUGUSTUS    mRNA    42655   44668   .   -   .   ID=g1.t1;geneID=g1
scaffold_0  AUGUSTUS    CDS 42655   43423   0.94    -   1   Parent=g1.t1
scaffold_0  AUGUSTUS    CDS 44445   44668   0.82    -   0   Parent=g1.t1
scaffold_0  AUGUSTUS    mRNA    51102   55274   .   +   .   ID=g2.t1;geneID=g2
scaffold_0  AUGUSTUS    CDS 51102   51192   0.60    +   0   Parent=g2.t1
scaffold_0  AUGUSTUS    CDS 51310   51528   0.80    +   2   Parent=g2.t1
scaffold_0  AUGUSTUS    CDS 52816   52845   0.64    +   2   Parent=g2.t1
scaffold_0  AUGUSTUS    CDS 53114   53223   0.50    +   2   Parent=g2.t1
scaffold_0  AUGUSTUS    CDS 53333   53633   0.91    +   0   Parent=g2.t1
scaffold_0  AUGUSTUS    CDS 53981   54296   0.64    +   2   Parent=g2.t1
scaffold_0  AUGUSTUS    CDS 54559   54581   0.94    +   1   Parent=g2.t1
scaffold_0  AUGUSTUS    CDS 54783   54975   0.89    +   2   Parent=g2.t1
scaffold_0  AUGUSTUS    CDS 55184   55274   0.66    +   1   Parent=g2.t1

mac gff fasta • 5.2k views

ADD COMMENT • link 6.0 years ago by benjamin ▴ 10

0

Entering edit mode

Have you tried gffread - part of gff utilities (http://ccb.jhu.edu/software/stringtie/gff.shtml)?

ADD REPLY • link 6.0 years ago by jean.elbers ★ 1.7k

0

Entering edit mode

Hi, thanks for you help. How could i run bedtools geftasta with a cds entrie? I cannot find this argument on the commande line?

ADD REPLY • link 6.0 years ago by benjamin ▴ 10

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLY • link 6.0 years ago by GenoMax 141k

0

Entering edit mode

Ok, it is done, thank you.

ADD REPLY • link 6.0 years ago by benjamin ▴ 10

score 2 · Answer 1 · 2018-04-20

2

Entering edit mode

6.0 years ago

Carlo Yague 8.7k

I think that bedtools getfasta will not parse your gff3 file based on [cds -> mRNA] relationship. Therefore you get either the sequence cds by cds, or from start to end of the mRNA (including introns). The frame shift caused by the introns would explain the number of stop codons you find.

This would make sense based on the example you provide:

Length of g1.t1 first intron = 44668-44445 = 223
number of a.a encode in the first intron = 223/3 ~= 74
number of consequently matching a.a bewteen theoretical sequence and the outfile = 75 a.a
( MIAALVPREPPVGPPGRRAEGRGGIHCDATNNHTYTSPHIHSRHVHAVADRMSTLGKNLIGAPWRGVTTRGTTV )

A solution would be to run bedtools getfasta your cds entries only, then to merge them by ID.

ADD COMMENT • link 6.0 years ago by Carlo Yague 8.7k

0

Entering edit mode

Hi, thanks for you help. How could i run bedtools geftasta with a cds entrie? I cannot find this argument on the commande line?

ADD REPLY • link 6.0 years ago by benjamin ▴ 10

0

Entering edit mode

As @Carlo said you will have to get CDS entries individually and then merge them for individual ID's. But the easier option would likely be using the gffread utility that was posted above by @jean.elbers.

ADD REPLY • link 6.0 years ago by GenoMax 141k

0

Entering edit mode

just create the file yourself. Something like:

awk '$3 == "CDS" {print $0}'  run_augustus_0035.gff.out > run_augustus_0035_cds_only.gff

ADD REPLY • link 6.0 years ago by Carlo Yague 8.7k

0

Entering edit mode

Thanks you, it works now but now I have another issue beacause it works well for my sequences starting by ATG (0 reading frame), but when I have a CDS which has a reading frame 1 or 2, my sequences are not in the good reading frame and I have to delete 2 or 1 base for each sequence. Does someone have an idea how I could from my gff file, see the readind frame and if it is 1, delete one base or 2 delete 2 bases to get the correct amino acide sequence when I translate the dna sequence?

ADD REPLY • link 6.0 years ago by benjamin ▴ 10

0

Entering edit mode

You need to merge the DNA sequence of your cds before translating it, it will fix the reading phase problem.

ADD REPLY • link 6.0 years ago by Carlo Yague 8.7k