Question

Renaming fasta file according to a name list (blast output)

0

Entering edit mode

7.6 years ago

san.san ▴ 190

I need to rename a fasta file according to a blast output file where my fasta was a query. Not all sequences had a hit.

my.fasta:

>ASSI-1_0 
TTCCTTTTTGGTTCTCGATATTATGAACAGTTTCTCATCA
>ASSI-2_1 
GTGAGGGAGGAGGACGCCTCGAGCAGAGGTAGGTCTGGAG
>ASSI-3_2 
ATGTTAGCAAGTATAGCCAACTATATGAAACCTATGTCTT
>ASSI-4_3 
GAATATCATTAAAAATCTACATTTATTTATGAGTTAGTAC
>ASSI-5_4 
AGGACACCAGAAACTTTCTCCAAAGCTGAATTTGTGTATT

blast.out (one hit per query and just first two columns):

ASSI-3_2    scaf0270669_20068.102
ASSI-4_3    scaf0189112_70083.538
ASSI-5_4    scaf0083789_70072.963
ASSI-8_7    scaf0423760_50193.589
ASSI-11_10  scaf0285971_60192.428
ASSI-12_11  scaf0409557_70062.641

What I need to get is this:

>ASSI-1_0
TTCCTTTTTGGTTCTCGATATTATGAACAGTTTCTCATCA
>ASSI-2_1
GTGAGGGAGGAGGACGCCTCGAGCAGAGGTAGGTCTGGAG
>ASSI-3_2 scaf0270669_20068.102
ATGTTAGCAAGTATAGCCAACTATATGAAACCTATGTCTT
>ASSI-4_3 scaf0189112_70083.538
GAATATCATTAAAAATCTACATTTATTTATGAGTTAGTAC
>ASSI-5_4 scaf0083789_70072.963
AGGACACCAGAAACTTTCTCCAAAGCTGAATTTGTGTATT

The way I went about this is this:

Pull out fasta headers and get rid of ">":

grep "^>" my.fasta | sed -e 's/>//g' > headers.fasta
Cut first two columns out of blast output and only leave one hit per query:

awk '!x[$1]++' blast.out | cut -f1,2 > headers.blast
Then I merge two header files but with tac so then I can use my awk command to get rid of duplicates, and then I use tac again to restore name order:

tac headers.blast headers.fasta | sort -r -V | awk '!x[$1]++' | tac > headers.clean

headers.clean:

ASSI-1_0
ASSI-2_1
ASSI-3_2    scaf0270669_20068.102
ASSI-4_3    scaf0189112_70083.538
ASSI-5_4    scaf0083789_70072.963

Then replace the original headers with new headers (only works for fasta formats where sequences are in one-line):

awk 'NR%2==0' my.fasta | paste -d'\n' headers.clean -> my_renamed.fasta

The third step is what's giving me trouble. It's a very clumsy approach and I want to see if anybody has a better way?

fasta sequence command-line • 3.2k views

ADD COMMENT • link updated 7.6 years ago by shenwei356 8.4k • written 7.6 years ago by san.san ▴ 190

1

Entering edit mode

7.6 years ago

shenwei356 8.4k

I'd like to use SeqKit, only one single command.

Just download the executable binaries for your operating system (Windows/Linux/Mac OS X) and run:

$ seqkit replace -p '^([^ ]+)' -r '$1 {kv}' -k blast.out my.fasta
[INFO] read key-value file: blast.out
[INFO] 6 pairs of key-value loaded
>ASSI-1_0  
TTCCTTTTTGGTTCTCGATATTATGAACAGTTTCTCATCA
>ASSI-2_1  
GTGAGGGAGGAGGACGCCTCGAGCAGAGGTAGGTCTGGAG
>ASSI-3_2 scaf0270669_20068.102 
ATGTTAGCAAGTATAGCCAACTATATGAAACCTATGTCTT
>ASSI-4_3 scaf0189112_70083.538 
GAATATCATTAAAAATCTACATTTATTTATGAGTTAGTAC
>ASSI-5_4 scaf0083789_70072.963 
AGGACACCAGAAACTTTCTCCAAAGCTGAATTTGTGTATT

ADD COMMENT • link 7.6 years ago by shenwei356 8.4k

0

Entering edit mode

Awesome! I needed to do the same and this worked perfectly!

ADD REPLY • link 6.0 years ago by MarGar • 0

0

Entering edit mode

7.6 years ago

WouterDeCoster 47k

I would write a biopython script to get it done, create a dictionary out of the blast output and loop over your fasta file

ADD COMMENT • link 7.6 years ago by WouterDeCoster 47k

score 3 · Accepted Answer · 2016-09-14

3

Entering edit mode

7.6 years ago

Pierre Lindenbaum 161k

use the join command

$ join -t '    ' -a 2 -1 1 -2 1 <( cat input.blast | sort -t ' '  -k1,1 )  <(cat input.fasta | paste - - | cut -c2- | sort -t '        ' -k1,1) | awk -F '      ' '{printf(">%s",$1);i=2;if(NF==3) {printf(" %s",$2);i++;} printf("\n%s\n",$i);}'

the options -t (join , sort) and -F (awk) require a tabulation. (Ctrl-v tab)

>ASSI-1_0
TTCCTTTTTGGTTCTCGATATTATGAACAGTTTCTCATCA
>ASSI-2_1
GTGAGGGAGGAGGACGCCTCGAGCAGAGGTAGGTCTGGAG
>ASSI-3_2 scaf0270669_20068.102
ATGTTAGCAAGTATAGCCAACTATATGAAACCTATGTCTT
>ASSI-4_3 scaf0189112_70083.538
GAATATCATTAAAAATCTACATTTATTTATGAGTTAGTAC
>ASSI-5_4 scaf0083789_70072.963
AGGACACCAGAAACTTTCTCCAAAGCTGAATTTGTGTATT

ADD COMMENT • link 7.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Would this command work when not all my sequences in the my.fasta file had a blast hit?

ADD REPLY • link 7.6 years ago by san.san ▴ 190

1

Entering edit mode

yes that's the option -a of join

       -a FILENUM
              also  print unpairable lines from file FILENUM, where FILENUM is
              1 or 2, corresponding to FILE1 or FILE2

ADD REPLY • link 7.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I'm getting this when I run it:

>STE-1_0
TCTTAACTCATTGTGTGTGTATGCC
>STE-100000_218031
scaf0293423_20071.471
>STE-10000_10255
TCTGGAAAATAAAGCGGTCCACCCC
>STE-100001_218032
scaf0391581

I wonder if it's because some blast outputs are like so:

STE-121254_274055   scaf7351783

Rather than:

STE-121254_274056   scaf0143876_10040.799

ADD REPLY • link 7.6 years ago by san.san ▴ 190

0

Entering edit mode

I just realised why it didn't work. Because I used blast output file that included other columns apart from the first two.

ADD REPLY • link 7.6 years ago by san.san ▴ 190

0

Entering edit mode

I'm wondering, how do I order my sequences to be >ASSI-1_0 then >ASSI-2_1 rather than >ASSI-1_0 and then >STE-100000_218031?

ADD REPLY • link 7.6 years ago by san.san ▴ 190