Fasta sequence replacement based on header name
3
3
Entering edit mode
8.8 years ago
seta ★ 1.9k

Hi everybody,

I have two fasta file with two kinds of header format, I want to replace sequences of interest in one file based on header name (I have two list of headers for two fasta files as txt format). Could you please advise me what should I do?

For example header format for file 1 and 2 are as >contig10002|m.12543 and >c26528_g1_i1|m.14066, respectively, and I want to replace the related sequence of >c26528_g1_i1|m.14066 in file 2 with related sequence of >contig10002|m.12543.

alignment Assembly sequencing RNA-Seq • 4.5k views
ADD COMMENT
0
Entering edit mode

What happens with sequences you do not wish to replace in file 2 ? Should they stay there or should they be removed ??

ADD REPLY
0
Entering edit mode

other sequences should be stayed there.

ADD REPLY
1
Entering edit mode

One more question: Are files big? How many entries do they contain and how large is an individual fasta record (average)? Also do you need this to run fast or you can spare some time?

ADD REPLY
0
Entering edit mode

about big. file 1 and 2 composed of about 47000 and 150,000 sequences, respectively. I like everyone would like to run fast

ADD REPLY
0
Entering edit mode

You might just write a program using biopython or bioperl to do that. That will likely prove to be the simplest method.

ADD REPLY
0
Entering edit mode

I guess, but it's not easy for me as I'm basically biologist. Could you please let me this process is right?

Firstly, I can remove sequences of interest in file 2 (my mean is those sequences that must be replaced with other sequences from file 1), then I merged file with desired sequences in file 1 with file 2 by cat command. Is is right?

ADD REPLY
1
Entering edit mode
8.8 years ago
mxs ▴ 530

Ok based on your specs stated in comment section it looks like a simple perl script should satisfy. Ergo:

#!/usr/bin/perl

use strict;
use warnings;

use Getopt::Long;

#set variables
my ($optHelp, $optFile1, $optFile2, $optList);

#get program options
GetOptions ('h' => \$optHelp, 'a=s' => \$optFile1, 'b=s' => \$optFile2, 'l=s' => \$optList);

# there are better ways to do this but I like this strategy :)
if($optHelp || !$optFile1 || !$optFile2  || !$optList ){
  print "Usege:\n\n";
  print "./program -a file1 -b file2 -l list<.tsv> \n\n";
  print "Note: sequences from file2 are replaced with those in file1 according to the list -l\n\n";
  exit(0);
}


#allocate memory (well, more of set variables)
my %hash1 =();
my %list = ();

#open files
open (ONE, "<", $optFile1) or die "$!";
open (TWO, "<", $optFile2) or die "$!";
open (LIST, "<", $optList) or die "$!";

#load list which should look like:
#file1_id <tab> file2_id

while(<LIST>){
  chomp;
  /^(.*?)\t(.*)/;
  $list{$1} = $2; #e
}
close LIST;

#read file one into memory
my $head;
while(<ONE>){
  chomp;
  if(/>(.*)/){$head = $1; next;} #e
  $hash1{$head} .= $_;
}
close ONE;

#replace fasta record from file to with the one in file one if id is set in list file
my $x = 0;
while(<TWO>){
  chomp;
  ## You can play with this if one line fasfa sequence is not what you can use
  if(/>(.*)/){if(defined $list{$1}){print $_. "\n" . $hash1{$list{$1}} . "\n"; $x = 1}else{$x = 0}}  #e
  print $_ . "\n" if $x == 0;  #e
}
close TWO;

You copy the code into a file (regilar decument) and run it by executing :

perl program_name -a first_file -b second_file -l list_of_fasta_ids > output.file

Let me know if there are any problems with the program, since it is untested.

cheers

UPDATE:

corrected bugs are marked with #e. I said I haven't tested it :):

so:

file1

>Simulated_Sequence1
ATGGACGGGATTAATCCTGAATACTCTAACAGAAAGAGCTCCAATTATCATCTATACGGCCGGGAGAGTATCGCATGGGC
ATTAATATCCTATTCACTTC
>Simulated_Sequence2
ATGTTACTACGGGGTGGTCCGTTTAGCTCATATCCATCAACGTGGGACGACCTTGATCGAAGCACCCTCGCACGGTTTTA
TGGTGCTCGGATATACCGCC
>Simulated_Sequence3
ATGTTTTGCGTACAGGGCGTCACCCCCCGCGATTTATTTGCTGGCGAATCAGACGGCCTGGACGGTAGTGCCCGCATATG
TGCTGGTAGCGGCCTTTATG
>Simulated_Sequence4
ATGGACTCATATTTCGCCGCGTTTTACTTTGTTATCTATTGTCTGCTTGACGGCCACGTTGTGTACGGGTACACATCAAA
GTACTGCCGGGTTACAATGT

file2

>hSimulated_Sequence1
ATGTCTATTCTCACCGTTAGACAGGAATCCCGAGTCGAAGGAGGCTTTATGTCTATATGTCGGCATTATCCGTACGGCCT
GTTCTACGTAACATTTTCAT
>hSimulated_Sequence2
ATGCCTCAACAAGTCTTATGGAGGAATTTACGTGCTCCGCGGCCATTATACCCAGGATCACTGGGGATGTACTCACGAAT
AGTGGGTCGTGACAGGCAGG
>hSimulated_Sequence3
ATGGGTGCTGGCGAACAGATCGACCTGCTTGACTCTACCTCCTATCCCGTCCAGGACACTCTAGCTTTTTATAGCCGGGC
CAACCAGGGGAGAGAAATAA
>hSimulated_Sequence4
ATGATCCAATCGAAACCGGAAATGAGATCCATTCGGTGTCCGGGTGAATCTCGCGAGGGATGGTATACATCGCATATTGG
TACTAAGCTCTGGGTATCGG

list

hSimulated_Sequence2    Simulated_Sequence2
hSimulated_Sequence4    Simulated_Sequence3

output:

perl program.pl -a file1 -b file2 -l list
>hSimulated_Sequence1
ATGTCTATTCTCACCGTTAGACAGGAATCCCGAGTCGAAGGAGGCTTTATGTCTATATGTCGGCATTATCCGTACGGCCT
GTTCTACGTAACATTTTCAT
>hSimulated_Sequence2
ATGTTACTACGGGGTGGTCCGTTTAGCTCATATCCATCAACGTGGGACGACCTTGATCGAAGCACCCTCGCACGGTTTTATGGTGCTCGGATATACCGCC
>hSimulated_Sequence3
ATGGGTGCTGGCGAACAGATCGACCTGCTTGACTCTACCTCCTATCCCGTCCAGGACACTCTAGCTTTTTATAGCCGGGC
CAACCAGGGGAGAGAAATAA
>hSimulated_Sequence4
ATGTTTTGCGTACAGGGCGTCACCCCCCGCGATTTATTTGCTGGCGAATCAGACGGCCTGGACGGTAGTGCCCGCATATGTGCTGGTAGCGGCCTTTATG

I leave proper sequence formatting to you as an exercise :)

mxs

ADD COMMENT
0
Entering edit mode

Many thanks friend. I try it

ADD REPLY
0
Entering edit mode

Hi mxs,

It works almost fast, but there are some issue about it,

1) it removed other sequences and just keep replaced ones.

2) The sequences in the output file are located as tandem, so when I try to find the number of sequence in the output file using grep -c ">" file.fa command, it returned 1, but it's false. I would be highly appreciated for your help to solve the problem.

ADD REPLY
0
Entering edit mode

Though this does not justify my act, but I said I haven't tested it. script is corrected. :)

cheers

ADD REPLY
0
Entering edit mode

Hi mxs,

sorry, one of issues back to me and related to tsv format of name list, it sounds that all sequences are located there when I run script with tsv format of list, but I cannot check the number of sequence because of tandem location of sequences in the output file (that's why the command grep -c ">" file returned me 1) . Would you please let me know if there is a way to solve this problem? Thanks so much

ADD REPLY
0
Entering edit mode

So sorry, I have not seen your corrected bugs scripts. it works well now. Your attempt to solve the problem and help me would be highly appreciated. All the best for you

ADD REPLY
0
Entering edit mode
8.8 years ago
seta ★ 1.9k

Hi all,

one of our friends, mxs, presented a script to replace fasta sequence last day. I tried it and sounds that it worked, but now when I randomly check the length of replaced sequences to make sure of replacement (because length of replaced sequences differ from primary sequences) , I found that no sequence has been replaced, in fact output file is the same with input file. Even, I tried the script on mxs's example, but it did not work. Could you anybody advise me what's wrong? Thanks in advance.

ADD COMMENT
1
Entering edit mode

What do you mean it did not work??

I leave proper sequence formatting to you as an exercise

was my final comment!!!

that means that replaced sequences are one liners since I did not have time nor ambition to do that plus this should be a fairly simple task anyway.

Next: "Output is the same as an input": that sounds like the header information was not matched and for that to work I should have a small example (including header information and the actual sequnce) Usually the mistake people make in this type of cases is to literally create a list file as: seq1(space)<tab>(space)seq2

there are no spaces in the list file. only header info then tab then header. there are couple of other situations where the error might have occurred but I would need to see a small example and the way you run the script. also, if you are not on a unix machine please let me know.

mxs

ADD REPLY
0
Entering edit mode

Please be patient, I don't like to bother you, at all. I'm so grateful for your effort to solve the problem. As I mentioned in my previous comment, I found that no sequence replaced in the output file when I checked some sequences length. That's why, I tried to do task with your example, so copy, pasted file1, file2 and list. I'm in linux, here is output file with command that I entered: (sequence_replace is the name of your script). I also note that your point about space in the list.

Seta@chpc:~/software/bbmap$ perl sequence_replacement.pl -a file1 -b file2 -l list2.tsv

>hSimulated_Sequence1
ATGTCTATTCTCACCGTTAGACAGGAATCCCGAGTCGAAGGAGGCTTTATGTCTATATGTCGGCATTATCCGTACGGCCTGTTCTACGTAACATTTTCAT
>hSimulated_Sequence2
ATGCCTCAACAAGTCTTATGGAGGAATTTACGTGCTCCGCGGCCATTATACCCAGGATCACTGGGGATGTACTCACGAATAGTGGGTCGTGACAGGCAGG
>hSimulated_Sequence3
ATGGGTGCTGGCGAACAGATCGACCTGCTTGACTCTACCTCCTATCCCGTCCAGGACACTCTAGCTTTTTATAGCCGGGCCAACCAGGGGAGAGAAATAA
>hSimulated_Sequence4
ATGATCCAATCGAAACCGGAAATGAGATCCATTCGGTGTCCGGGTGAATCTCGCGAGGGATGGTATACATCGCATATTGGTACTAAGCTCTGGGTATCGG

As it's obvious the output file is the same with file2 instead of replacing sequence number 2 and 4.

Thanks

ADD REPLY
0
Entering edit mode

hm... if you just copy/past sequences, then two line sequence should be printed in case of sequnce 1 and 3. to me this looks like a formatting problem. how are your line ends defined? can you add use Data::Dumper; before the use warnings; and then at the end of the script add

print Dumper(\%hash1);
print Dumper(\%list);

and post the output.

ADD REPLY
0
Entering edit mode
8.8 years ago
seta ★ 1.9k

OK, I added what you want and here is the output:

Seta@chpc:~/software/bbmap$ perl sequence_replacement1.pl -a file1 -b file2 -l list2.tsv

>hSimulated_Sequence1
ATGTCTATTCTCACCGTTAGACAGGAATCCCGAGTCGAAGGAGGCTTTATGTCTATATGTCGGCATTATCCGTACGGCCT
GTTCTACGTAACATTTTCAT
>hSimulated_Sequence2
ATGCCTCAACAAGTCTTATGGAGGAATTTACGTGCTCCGCGGCCATTATACCCAGGATCACTGGGGATGTACTCACGAAT
AGTGGGTCGTGACAGGCAGG
>hSimulated_Sequence3
ATGGGTGCTGGCGAACAGATCGACCTGCTTGACTCTACCTCCTATCCCGTCCAGGACACTCTAGCTTTTTATAGCCGGGC
CAACCAGGGGAGAGAAATAA
>hSimulated_Sequence4
ATGATCCAATCGAAACCGGAAATGAGATCCATTCGGTGTCCGGGTGAATCTCGCGAGGGATGGTATACATCGCATATTGG
TACTAAGCTCTGGGTATCGG
$VAR1 = {
',CTGGTAGCGGCCTTTATGGGGCGTCACCCCCCGCGATTTATTTGCTGGCGAATCAGACGGCCTGGACGGTAGTGCCCGCATATG
',GTGCTCGGATATACCGCCTGGTCCGTTTAGCTCATATCCATCAACGTGGGACGACCTTGATCGAAGCACCCTCGCACGGTTTTA
',TAATATCCTATTCACTTCTCCTGAATACTCTAACAGAAAGAGCTCCAATTATCATCTATACGGCCGGGAGAGTATCGCATGGGC
GTACTGCCGGGTTACAATGT'GCCGCGTTTTACTTTGTTATCTATTGTCTGCTTGACGGCCACGTTGTGTACGGGTACACATCAAA
        };
$VAR1 = {
          'hSimulated_Sequence4' => 'Simulated_Sequence3',
          'hSimulated_Sequence2' => 'Simulated_Sequence2'
        };

What's the wrong?

ADD COMMENT
0
Entering edit mode

As you can see file1 has not properly been uploaded due to probably formatting issues. what do you get if you type

file file1
file file2

usually in such cases the problem is with the copy/past protocol, which is odd since the second file and the list file seam not to have been affected. run the above file function on your files and make sure that they are in fact properly formatted fasta files. In my experience in several cases people have been "adjusting" their emac and vim editors and messed up the end line formatting's which caused them to see as if everything was formatted properly but it turned out not to be. could their be a possibility that the same happened here?? what text editor are you using??

ADD REPLY
0
Entering edit mode

the output for file file1 is: file1: ASCII text, with CRLF line terminators and for file file2 is: file2: ASCII text, with CRLF line terminators

I'm sure about the fasta format of my sequences, please let me know what do you mean "properly formatted fasta file"?

I use nano as text editor. However, I would like to find why script did not act?, it's yet hard to me to find it as I'm biologist and recently come to bioinformatics.

ADD REPLY
1
Entering edit mode

well there is your problem. If you are on a unix machine then files should be properly formatted. Unix-style LF line terminators are expected. As I said many things can be the underlying cause of this. Most frequent ones being copy/past from dos to unix (cygwin users on windows as a possible target group) or "setting" editors (nano, vim, emac) and messing it up somehow, etc. How to fix it:

run

dos2unix file1
dos2unix file2
dos2unix list

or if you are not in a position to install dos2unix program you might just run:

sed -i 's/\r/\n/' file1
sed -i 's/\r/\n/' file2
sed -i 's/\r/\n/' list

sed should be installed by default (at least on debian/ubuntu flavoured os)

ADD REPLY
0
Entering edit mode

That's great, thanks mxs. I used "sed" and then run script, sequences were replaced in the your example file, fine. But, hard to say it has not been replaced on my fasta sequences, even after running above-mentioned commands (sed) on them. I'm getting crazy, why it is?!

ADD REPLY
0
Entering edit mode

ok. give mi an example consisting of 2 sequences from each file you use in the real case scenario and 2 entries from the actual list file you use.

ADD REPLY
0
Entering edit mode

Oh, it's really kind of you. I prepare an example of my file for you. If it's possible, could you please give me your email address to attached them for you instead of copy and pasting here?, you know length of my sequences is a bit large.

ADD REPLY
0
Entering edit mode

I checked script on example file, which ready for you, it surprisingly works. As you kindly mentioned formatting is important especially for list name that I copy and past. However, I found that after using sed -i 's/\r/\n/' list it created space among names, like this:

c50740_g1_i2|m.48937    contig10084|m.12639

c39071_g1_i1|m.24963    contig10085|m.12640

But, they are without space in the example like here:

c50740_g1_i2|m.48937    contig10084|m.12639
c39071_g1_i1|m.24963    contig10085|m.12640

It may cause the problem (big problem), if it's right, could you please let me know how should I use the sed command (for proper formatting) without space?

I don't know how to thank you

ADD REPLY
1
Entering edit mode

well the script in my initial answer should be ok with the list file as is, but for educational purp. to remove empty lines from files run:

sed -i '/^\s*$/d' list

In the future, just a suggestion, if you have a question like the one presented here. I believe a better site for this particular problem would be: http://www.perlmonks.org/

since there we expect these kind of questions on a regular basis and you probably will get more efficient solution (efficient with respect to memory and/or CPU time) to your problem than here. Since this is more of an application usage than application development forum. And for C/C++ questions c board is quite good and we provide help on/for any level. so ...might be worth to think about it. I understand that the underlying problem here was: where to start and know what solution would be the right one, but if you are a starter you cannot go wrong with language like Perl. As an example equivalent to sed call above in Perl is:

perl -ne 's/^\s*$//g; print $_' list > list_new

So basically what I am trying to say learn a language (any language - programming) and you'll be more efficient in any aspect of data analysis.

cheers

mxs

ADD REPLY
0
Entering edit mode

Many thanks for your nice advice. I'm trying always to learn language, but it's hard for me as beginner with no previous background and nice teacher like you. I wish there was online website to learn such task...Again thanks my friend.

ADD REPLY

Login before adding your answer.

Traffic: 1882 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6