How to append two fasta files?
1
0
Entering edit mode
4.9 years ago
Kumar ▴ 120

I have two fasta files as shown below,

File:1

>Contig_1:90600-91187
AAGGCCATCAAGGACGTGGATGAGGTCGTCAAGGGCAAGGAACAGGAATTGATGACGGTC
Contig_98:35323-35886
GACGAAGCGCTCGCCAAGGCCGAAGAAGAAGGCCTGGATCTGGTCGAAATCCAGCCGCAG
>Contig_24:26615-28387
GCTGCGGCGCTGATCCTGGCGGCCCGCGCCGAGGAGATCGCCCGTTTGGAGCGCGGCGAA

File:2

>Contig_1:90600-91187                                                                                                                                                                      
GACCGTCATCAATTCCTGTTCCTTGCCCTTGACGACCTCATCCACGTCCTTGATGGCCTT                                                                                                
>Contig_24:26615-28387                                                                                                                                                             
TTCGCCGCGCTCCAAACGGGCGATCTCCTCGGCGCGGGCCGCCAGGATCAGCGCCG

Both files are having the same fasta headers but vary in terms of sequences. Hence, I need to replace File:2 sequences in File:1 as shown below,

Expected outcome:

>Contig_1:90600-91187                                                                                                                                                                                 
GACCGTCATCAATTCCTGTTCCTTGCCCTTGACGACCTCATCCACGTCCTTGATGGCCTT                                                         
>Contig_98:35323-35886                                                                                                                                                                                                                
GACGAAGCGCTCGCCAAGGCCGAAGAAGAAGGCCTGGATCTGGTCGAAATCCAGCCG              
>Contig_24:26615-28387                                                                                                                                                                                                                                                                          
TTCGCCGCGCTCCAAACGGGCGATCTCCTCGGCGCGGGCCGCCAGGATCAGCGCCG

I tried with cat command, but it is concatenating all the sequences instead of replacing the sequences as I mentioned above.

unix fasta python • 1.5k views
ADD COMMENT
0
Entering edit mode
4.9 years ago
Corentin ▴ 600

Hi,

You can use samtools faidx

samtools faidx file1 Contig_1:90600-91187  > result.fa

samtools faidx file2 Contig_98:35323-35886 >> result.fa
samtools faidx file2 Contig_24:26615-28387 >> result.fa

However, the name format "contig:start-stop" could cause some issues, faidx might interpret it as a command to extract just a portion of the sequences. You can rename the contigs as "name_start_stop" to avoid this.

ADD COMMENT
0
Entering edit mode

Thank you Corentin. However, I have large fasta file, in that case it would be tedious to implement samtools faidx.

ADD REPLY
0
Entering edit mode

There is an option to give a list of regions to faidx:

-r, --region-file FILE   File of regions.  Format is chr:from-to. One per line.

You could create two files, one with a list of regions from file1, the other from file2 and run two faidx commands

ADD REPLY
0
Entering edit mode

Thank you Corentin, Is it possible to extract the sequences based on the reverse order coordinates by samtools faidx. For example, Instead of proper order

Contig_98:35323-35886

If it like this,

Contig_98:35886-35323

Will it work?

ADD REPLY
1
Entering edit mode

I am not exactly sure what you want to do, if you want the reverse complement, faidx also have an option for that:

   -i, --reverse-complement
    Output the sequence as the reverse complement. `When this option is used, “/rc” will be appended to the sequence names. To turn this off or change the string appended, use the --mark-strand option`.

You can see the documentation here : http://www.htslib.org/doc/samtools.html

ADD REPLY
0
Entering edit mode

Dear Corentin, If the coordinates are in reverse order, will samtools work? Let me put in this way, I have the series of coordinates to be extracted from the given data,

Contig_98:35323-35886
Contig_1:90600-91187
Contig_24:26615-28387
Contig_91:690-450

The above mentioned coordinates are well formatted except the last one Contig_91:690-450 . Because, the last coordinate is in reverse order. In that case will samtool works?

ADD REPLY
1
Entering edit mode

Apparently faidx is giving this error if the coordinates are reversed:

[faidx] Zero length sequence:
ADD REPLY
0
Entering edit mode

Thank you Corentin for your time and help. I have implemented your suggestions and all works well.

ADD REPLY

Login before adding your answer.

Traffic: 2443 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6