Labelling sequences within fasta files according to sample name.
2
0
Entering edit mode
6.6 years ago
Mitra • 0

Hello everybody, I have multiple fasta files from multiple samples I am trying to add the sample names in each sequence within each fasta file.

My one file looks like :

>M03691:51:000000000-BD94Y:1:1101:14841:1381 1:N:0:1
ACTGGGTGTAAAGGGCGTGTAGGCGGAGAAGCAAGTCAGAAGTGAAATCCATGGGCTTAACCCATGAACTGCTTTTGAAACTGTTTCCCTTGAGTATCGGAGAGGCAGGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCCTGCTGGACGACAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCCGGT
>M03691:51:000000000-BD94Y:1:1101:15960:1389 1:N:0:1
TACTGGGGTATCTAATCCTATTTGCTCCCCACGCTTTCGGGACTGAGCGTCAGTTATGCGCCAGATCGTCGCCTTCGCCACTGGTGTTCCTCCATATATCTACGCATTTCACCGCTACACATGGAATTCCACGATCCTCTCACACACTCTAGCTCTACGGTTTCCATGGCTTACCGAAGTTAAGCTTCGATCTTTCACCACAGACCCTTAGTGCCGCCTGCTCCCTCTTTACACCCAGT
>M03691:51:000000000-BD94Y:1:1101:15662:1415 1:N:0:1
ACTGGGTGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCTGCGCCGGGTACGGGCGGGCTGGAGTGCGGTAGGGGAGACTGGAATTCCCGGTGTAACGGTGGAATGTGTAGATATCGGGAAGAACACCAATGGCGAAGGCAGGTCTCTGGGCCGTTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCCCGTA

Now For example I want to add Sample1 in front of every sequence in this file. Keeping everything else as it is.

In this case I found one old post in Biostart which is very similar.. C: Renaming Entries In A Fasta File ..But it completely rename the header. I want to keep everything only add the sample name after >.

And addition I want to do this for a batch of files. So I assume I need to run some loop?

I am trying with this code...and obviously not successful. When I am in the folder where I have all the fasta files. I try to do this.

  for f in *.fasta ; do
     bname=`basename $f`
      pref=${bname%%.fasta}
       awk '/^>/{print ">bname" ++i next; next}{print}' < $f > ${pref}_new.fasta; 
    done

Can anybody please help me with this? Thanks,

Mitra

sequencing next-gen awk • 3.9k views
ADD COMMENT
0
Entering edit mode

Are you doing this for the purpose of adding read groups to the samples?

ADD REPLY
0
Entering edit mode

I need to add Sample names to all sequences as later I want to concatenate all fasta files together and feed it for OTU picking in qiime. Thanks.

ADD REPLY
0
Entering edit mode

Qiime has accessory programs to do this sort of thing. Are you not following their workflow?

ADD REPLY
0
Entering edit mode

Please let me know if QIIME has any direct way to do this? I could't find any ..... There only it said I need to pass the file as labelled if I work with demultiplexed files. But not said how I can do this. So this is above as I was trying.

ADD REPLY
1
Entering edit mode
6.6 years ago
shoujun.gu ▴ 380

I modified the script. You could try:

  1. save the code in a file named 'biostar.py' (or any other name)
  2. move all your fasta files into a new folder named 'newfolder' (or any other name)
  3. in shell (make sure python version is 3.5 or later), run: python3 biostar.py newfolder
  4. the output files are in the same folder with '_out' at the end of the original filename. You can modify it to what you want in the script.
import sys
import subprocess
import os
dir=sys.argv[1]
os.chdir(dir)
p=subprocess.run(["ls"], stdout=subprocess.PIPE)
filelist=p.stdout.strip().decode('ascii').split('\n')

for name in filelist:
    output=name+'_out'
    fa=[]
    with open(name, 'r') as file:  
        for line in file:  
            if line[0]=='>':  
                line='>'+name+line[1:]  
            fa.append(line)  
    with open(output, 'w') as out:  
        out.writelines(fa) 

hope the result is what you want

ADD COMMENT
0
Entering edit mode

Hi shoujun.gu, should I run this in a for loop for multiple input file? I have over 100 fasta files to run... Can you please suggest?Thanks Mitra

ADD REPLY
0
Entering edit mode

i modified the previous answer. Hope it works.

ADD REPLY
0
Entering edit mode

Thank you very very much shoujun.gu you saved my day :) A BIG Thank you

ADD REPLY
0
Entering edit mode
6.6 years ago

BBMap has a tool called rename.sh which I use for this purpose:

rename.sh in=file.fa prefix=sample1 out=renamed.fa addprefix

There's also a related tool, "muxbyname.sh", which is great for bulk operations (renaming sequences from many files based on their origin file and outputting them into a single file), but not quite applicable in this case.

ADD COMMENT
0
Entering edit mode

Hi Brian, should I run this in a for loop for multiple input file? Can you please suggest?Thanks Mitra

ADD REPLY
0
Entering edit mode

Is there a correlation between existing file names and the sample prefix that could be leveraged to create a loop? Do your files still have the barcodes at the beginning of the reads?

ADD REPLY
0
Entering edit mode

yes there is correlation ...File names are same as sample name which I want to insert after > for each sequence in each multifasta file.

These files are already demultiplexed and adapter+barcode removed. Then I stitched them using fastq join (within qiime 1.1.9) and converted them to fasta from fastq using fastx-toolkit. So now ihave all multifasta files for each sample.

ADD REPLY

Login before adding your answer.

Traffic: 2340 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6