DNA sequences trimming methods
1
1
Entering edit mode
6.9 years ago
l.souza ▴ 80

Hello,

This is my situtation:

I have about 2000 DNA sequences to process, but I just want to work with the coding region of them. I have the coordinates of all CDSs (that I got with Prodigal) in a file with this format:

DEFINITION  seqnum=1;seqlen=8075;seqhdr="KU821590.1 Foot-and-mouth disease virus - type SAT 1 isolate SAT1/NAM01/2010, complete 
genome";version=Prodigal.v2.6.3;run_type=Single;model="Ab initio";gc_cont=53.37;transl_table=11;uses_sd=0
FEATURES             Location/Qualifiers
     CDS                      1026..8045

/note="ID=1_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.537;conf=99.99;score=1639.83;cscore=1612.89;sscore=26.93;rscore=-13.40;uscore=34.66;tscore=5.68;"

How could I extract the sequence file that corresponds to the coordinates into a FASTA file?

dna trimming sequence cds • 1.8k views
ADD COMMENT
0
Entering edit mode

You'll need to parse out the header and coordinate information from your file, then match to the headers in your fasta, and use the coordinates per header to cut each sequence.

Can you post a few more lines of your file from Prodigal?

ADD REPLY
0
Entering edit mode
DEFINITION  seqnum=1;seqlen=8075;seqhdr="KU821590.1 Foot-and-mouth disease virus - type SAT 1 isolate SAT1/NAM01/2010, complete 
genome";version=Prodigal.v2.6.3;run_type=Single;model="Ab initio";gc_cont=53.37;transl_table=11;uses_sd=0
FEATURES             Location/Qualifiers
     CDS             1026..8045
                 /note="ID=1_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.537;conf=99.99;score=1639.83;cscore=1612.89;sscore=26.93;rscore=-13.40;uscore=34.66;tscore=5.68;"

DEFINITION  seqnum=2;seqlen=8010;seqhdr="KR108948.1 Foot-and-mouth disease virus - type SAT 1 isolate KNP/196/91/1 polyprotein gene, partial 
cds";version=Prodigal.v2.6.3;run_type=Single;model="Ab initio";gc_cont=53.37;transl_table=11;uses_sd=0
FEATURES             Location/Qualifiers
     CDS             1011..>8009
                 /note="ID=2_1;partial=01;start_type=ATG;rbs_motif=TTTA;rbs_spacer=14bp;gc_cont=0.537;conf=99.99;score=1624.62;cscore=1579.25;sscore=45.37;rscore=16.14;uscore=23.55;tscore=5.68;"

DEFINITION  seqnum=3;seqlen=8144;seqhdr="JF749860.1 Foot-and-mouth disease virus - type SAT 1 isolate KEN_004/2002, complete 
genome";version=Prodigal.v2.6.3;run_type=Single;model="Ab initio";gc_cont=53.37;transl_table=11;uses_sd=0
FEATURES             Location/Qualifiers
     CDS             1018..8037
                 /note="ID=3_1;partial=00;start_type=ATG;rbs_motif=AAA;rbs_spacer=14bp;gc_cont=0.540;conf=99.99;score=1468.42;cscore=1472.82;sscore=-4.40;rscore=0.64;uscore=-10.72;tscore=5.68;"

DEFINITION  seqnum=4;seqlen=8156;seqhdr="KM268899.1 Foot-and-mouth disease virus - type SAT 1 isolate TAN/22/2012, complete 
genome";version=Prodigal.v2.6.3;run_type=Single;model="Ab initio";gc_cont=53.37;transl_table=11;uses_sd=0
FEATURES             Location/Qualifiers
     CDS             1006..8025
                 /note="ID=4_1;partial=00;start_type=ATG;rbs_motif=TTTTA;rbs_spacer=14bp;gc_cont=0.537;conf=99.99;score=1462.32;cscore=1401.95;sscore=60.38;rscore=17.40;uscore=37.30;tscore=5.68;"

The file consists of repetitions like this...

ADD REPLY
0
Entering edit mode

Is this genbank format? You can convert it to bed (see some discussion here) and get the regions of interest with bedtools or bedops.

ADD REPLY
0
Entering edit mode

Not all of my sequences are genebank format!

ADD REPLY
0
Entering edit mode

What is the output format you chose for prodigal? Do you have a mix of formats?

ADD REPLY
1
Entering edit mode
6.9 years ago
l.souza ▴ 80

I could solve my problem calling ' -d ' in PRODIGAL parametres.

ADD COMMENT

Login before adding your answer.

Traffic: 2414 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6