Biostar Beta. Not for public use.
Rename fasta headers from CSV
0
Entering edit mode
12 months ago
SaltedPork • 100

I have a CSV file that looks like this:

   201200175,A/name1/175/2012
   201200287,A/name2/287/2012
   201200845,A/name3/845/2012

Currently my fasta headers look like:

>201200175_AA
>201200175_AB
>201200175_BB

and I want to change it to:

>A/name1/175/2012_AA
>A/name1/175/2012_AB
>A/name1/175/2012_BB

I want to preserve the suffix (_AA etc..). I have multiple fasta files, and they are all multifastas.

I was wondering if there is a quicker way in bash rather than writing out some Perl...

ADD COMMENTlink
0
Entering edit mode

Do the fasta files have linebreaks in seqs? If not, then perhaps e.g.

join -1 1 -2 1 -t $'\t' \
    <(sed 's/,/\t/' file.csv | sort -k1,1) \
    <(paste - - <file.fa | sed -e 's/_/\t/' -e 's/>//' | sort -k1,1) \
    | awk 'BEGIN{FS="\t"}{print ">"$2"_"$3"\n"$4}'
ADD REPLYlink
0
Entering edit mode

Thanks, no they don't.

ADD REPLYlink
3
Entering edit mode
12 months ago
SMK ♦ 1.3k
Ghent, Belgium

Hi SaltedPork,

Try seqkit:

$ cat changes.csv
201200175,A/name1/175/2012
201200287,A/name2/287/2012
201200845,A/name3/845/2012

$ cat sample.fasta
>201200175_AA
AGCTAGCTAGCTGCATGCTGCATGCTACG
>201200175_AB
AGCTGCATGCTAGCTGATCGTAGCTAGCT
>201200175_BB
GCTAGCTAGCTGCATGCTAGCTAGCTGCT

$ seqkit replace --kv-file <(sed "s/,/\t/g" changes.csv) --pattern "^(\d+)_(\w+)" --replacement "{kv}_\${2}" sample.fasta
[INFO] read key-value file: /dev/fd/63
[INFO] 3 pairs of key-value loaded
>A/name1/175/2012_AA
AGCTAGCTAGCTGCATGCTGCATGCTACG
>A/name1/175/2012_AB
AGCTGCATGCTAGCTGATCGTAGCTAGCT
>A/name1/175/2012_BB
GCTAGCTAGCTGCATGCTAGCTAGCTGCT
ADD COMMENTlink
0
Entering edit mode

Thanks for this answer, very clear.

I am getting fasta headers like >_AA, >_BB. The bit from the CSV is not there.

Could you explain what the sed and regex bits are doing. Sed is replacing commas with tabs in the csv, is this because seqkit doesn't handle the commas? Also what is the _ doing in the pattern bit?

ADD REPLYlink
0
Entering edit mode

Hi SaltedPork,

Sed is replacing commas with tabs in the csv, is this because seqkit doesn't handle the commas?

Yes, as seqkit replace -h shows: -k, --kv-file string tab-delimited key-value file for replacing key with value when using "{kv}" in -r (--replacement) (only for sequence name)

"^(\d+)_(\w+)" is the regex for 201200175_AA, 201200175_AB, and 201200175_BB. It's the part that you want to replace. _ is the symbol between 201200175 and AA.

ADD REPLYlink
0
Entering edit mode

Thanks, but my headers don't have the second value in them, just the _AA. Any ideas? I've tried playing with the regex but no luck.

ADD REPLYlink
0
Entering edit mode

Ah... Sorry didn't realize that you got leading spaces in your csv. Try this: seqkit replace --kv-file <(sed -r "s/^\s+//g; s/,/\t/g" changes.csv) --pattern "^(\d+)_(\w+)" --replacement '{kv}_${2}' sample.fasta

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1