Question

Orthomcl clusters to fasta

0

Entering edit mode

5.6 years ago

wocana ▴ 20

Any idea how to extract orthomcl Clusters to fasta?

I got 540 clusters i just want some of them on fasta.

The final output results like this:

cluster000: Ratn|NC_000067.6:7605211..7607094 Ratn|NC_000076.6:46984187..46986072 Ratn|NC_000072.6:126135324..126137210 Ratn|NC_000071.6:43059052..43060937 Ratn|NC_000069.6:92913971..92915859 Ratn|NC_000071.6:137999400..138001284 Ratn|NC_000084.6:3880095..3881981 Ratn|NC_000073.6:72629984..72631867 Ratn|NC_000073.6:14635959..14637845 Ratn|NC_000076.6:121566110..121567992 Ratn|NC_000086.7:51987005..51988889 Ratn|NC_000083.6:9587574..9589459 Ratn|NC_000073.6:82995991..82997877 Ratn|NC_000074.6:122158254..122160140 Ratn|NC_000074.6:4455815..4457701 Ratn|NC_000072.6:32035055..32036940

ORTHO MCL FASTA alignment CLUSTER • 2.0k views

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 5.6 years ago by wocana ▴ 20

1

Entering edit mode

There is no need to SHOUT. I have removed the uppercase characters from your title.

ADD REPLY • link 5.6 years ago by WouterDeCoster 47k

0

Entering edit mode

Is that representation for one cluster cluster000? So you basically want to get the intervals represented there in one fasta sequence?

ADD REPLY • link 5.6 years ago by GenoMax 141k

0

Entering edit mode

The cluster im interest of has 260 fasta IDs and i wanna extract them from the fasta file, first i organaized them 1 ID per line as you can see below "list.txt". Im using that script i modified it but its not working at all

ADD REPLY • link 5.6 years ago by wocana ▴ 20

0

Entering edit mode

im using this script but im just getting 136/260 i dont know why.. my list.txt look like this

CoGr|NW_001584579.1:68053..71823
CoGr|NC_008804.1:48922615..48926373
CoGr|NC_008801.1:545795465..545799243 
CoGr|NC_008804.1:161966386..161970130 
CoGr|NC_008807.1:226381677..226385438 
CoGr|NC_008805.1:191827747..191831534
CoGr|NC_008802.1:270867733..270871532 
CoGr|NC_008801.1:240232298..240236072
CoGr|NC_008803.1:202573007..202576775

SCRIPT

#!/usr/bin/env perl
my $list_file = $ARGV[0];
my $fasta_in = $ARGV[1];
my $fasta_out = $ARGV[2];
open(LIST_FILE, "<", $list_file) or die "could not open '$list_file' : $! \n";
open(FASTA_IN, "<", $fasta_in) or die "could not open '$fasta_in' : $! \n";
open(FASTA_OUT, ">", $fasta_out) or die "could not open $fasta_out : $! \n";
my @headers = ();
while(<LIST_FILE>) {
    chomp;
    next if ( /^\s*$/ );
    push(@headers,_);
}
my $pat = join '|', map quotemeta, @headers;
$/ = ">";
while(<FASTA_IN>) {
    chomp;
    if ( /$pat/ ) { print FASTA_OUT ">$_"; }
}
close(LIST_FILE);
close(FASTA_IN);
close(FASTA_OUT);

ADD REPLY • link 5.6 years ago by wocana ▴ 20

0

Entering edit mode

Try using Bioperl SeqIO and Bio::DB::Fasta, so you don't have to re-invent the wheels handling FASTA files...

ADD REPLY • link 5.6 years ago by Vitis ★ 2.5k

GenoMax · Answer 1 · 2018-09-20

1

Entering edit mode

5.6 years ago

lieven.sterck 15k

Since this is an orthomcl output, you should have a blast database of the proteins you clustered. So select the orhto fam you want, parse out the proteinsIDs and get those from the blastDB.

Get the list ids you need:

grep 'cluster000:' <mcloutput file> | sed 's/cluster000://' | sed 's/ /\n/g' > ids2get.lst

get those ids from the blastdb and write to file

blastdbcmd -db <your protein blastDB> -entry_batch ids2get.lst > proteins.fasta

ADD COMMENT • link 5.6 years ago by lieven.sterck 15k

0

Entering edit mode

Didnt work

Error: [blastdbcmd] CoGr|NW_001583259.1:530924..534690: OID not found
Error: [blastdbcmd] CoGr|NC_008801.1:39070112..39072248: OID not found
Error: [blastdbcmd] CoGr|NC_008801.1:104399108..104402881: OID not found
Error: [blastdbcmd] CoGr|NC_008801.1:127199820..127202817: OID not found
Error: [blastdbcmd] CoGr|NC_008801.1:130223934..130226919: OID not found
Error: [blastdbcmd] CoGr|NC_008801.1:120801884..120804770: OID not found
Error: [blastdbcmd] CoGr|NC_008805.1:235446572..235450134: OID not found
Error: [blastdbcmd] CoGr|NC_008802.1:315125658..315129411: OID not found
Error: [blastdbcmd] CoGr|NC_008804.1:399210142..399213935: OID not found
Error: [blastdbcmd] CoGr|NC_008802.1:352763833..352767598: OID not found
Error: [blastdbcmd] CoGr|NC_008803.1:129563892..129567698: OID not found

ADD REPLY • link updated 5.6 years ago by GenoMax 141k • written 5.6 years ago by wocana ▴ 20

1

Entering edit mode

are those the exact same IDs as you have in your input protein file? Can you provide the output of head on your input protein file?

You will at least have to remove the CoGR| part as that is added by orthomcl if I remember correctly

grep 'cluster000:' <mcloutput file> | cut -d ' ' -f2- | sed 's/CoGr|//g' |sed 's/ /\n/g' > ids2get.lst

I also replaced on of the sed cmds with cut (a bit neater)

ADD REPLY • link 5.6 years ago by lieven.sterck 15k

score 1 · Answer 2 · 2018-09-22

Thx you all at the end first i modified and then used lieven.sterck command

grep 'cluster000:' <mcloutput file> | sed 's/cluster000://' | sed 's/ /\n/g' > ids2get.lst

and then the script runned correctly

#!/usr/bin/env perl
my $list_file = $ARGV[0];
my $fasta_in = $ARGV[1];
my $fasta_out = $ARGV[2];
open(LIST_FILE, "<", $list_file) or die "could not open '$list_file' : $! \n";
open(FASTA_IN, "<", $fasta_in) or die "could not open '$fasta_in' : $! \n";
open(FASTA_OUT, ">", $fasta_out) or die "could not open $fasta_out : $! \n";
my @headers = ();
while(<LIST_FILE>) {
    chomp;
    next if ( /^\s*$/ );
    push(@headers,_);
}
my $pat = join '|', map quotemeta, @headers;
$/ = ">";
while(<FASTA_IN>) {
    chomp;
    if ( /$pat/ ) { print FASTA_OUT ">$_"; }
}
close(LIST_FILE);
close(FASTA_IN);
close(FASTA_OUT);