Biostar Beta. Not for public use.
Question: how to add the sample name to the end read headers
0
Entering edit mode

I would need to add the sample name at the end of all the read headers in that fasta sample. For example I have

#Sample1
#>read1
#ATGC
#Sample2
#>read1
#ATGC

Desire output:

#Sample1
#>read1/Sample1 
#ATGC 
#Sample2
#> read1/Sample2 
#ATGC

I can do it one by one using sed

sed 's/read1/read1\/Sample1/g' Sample1.fasta > Sample1_tagged.fasta

However I have hundreds of fasta samples. Any tips on how to do it all at once will be highly appreciated.

ADD COMMENTlink 17 months ago juan.galarza • 0 • updated 15 months ago Biostar 20
Entering edit mode
1

are these Sample1 and Sample2 file names? If you do not provide sufficient information, it would be xy problem and solutions posted here will be of no use. juan.galarza. If they are in different files:

$ awk -v OFS="\n" '/^>/ {getline seq} {print $0"/"FILENAME,seq}' Sample*

or

$ sed -e ' />/ F' Sample* | paste  - - - | awk '{print $2"/"$1"\n"$3}'

>read1/Sample1
ATGC
>read1/Sample2
ATGC

input files (Sample1 and Sample2)

$ tail -n+1 Sample*
==> Sample1 <==
>read1
ATGC

==> Sample2 <==
>read1
ATGC
ADD REPLYlink 17 months ago
cpad0112
11k
Entering edit mode
0
sed '/^>/s/$/\/SAMPLE/' in.fa > out.fa
ADD REPLYlink 17 months ago
Pierre Lindenbaum
120k
Entering edit mode
0

This would append string "SAMPLE" to each header of fasta and is different from OP intended output. OP wants to append sample names (sample 1, sample 2, sample 3 etc) to each sequence. From OP's post, it seems OP has several files name Sample1, Sample 2 etc and each file has a fasta sequence.

ADD REPLYlink 17 months ago
cpad0112
11k
Entering edit mode
0

Thank you for your answers cpad0112 and Pierre Lindenbaum. Indeed, I have several files named Sample1.fa, Sample2.fa etc...each with sequences in fasta format. I would like to append the file name to the sequences IDs within those files. For example the seq IDs from file Sample1.fa would be

>read1/Sample1 
>read2/sample1

and IDs from file Sample2.fa would be

>read1/Sample2
>read2/Sample2

The awk solution does this, however it produces a single output. Ideally I would like to get the relabelled sequences printed to their corresponding file. I.e. all sequences from file Sample1.fa printed to Sample1_relabel.fa and sequences from Sample2.fa printed to Sample2_relabel.fa etc...

ADD REPLYlink 17 months ago
juan.galarza
• 0
• updated 17 months ago
genomax
68k
Entering edit mode
0

juan.galarza :

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLYlink 17 months ago
genomax
68k
2
Entering edit mode

try this juan.galarza :

> for i in *.fa ; do awk -v OFS="\n" '/^>/ {getline seq} {print $0"/"FILENAME,seq}' $i > ${i%%.*}"_relabel.fa" ;done

Note: As a precaution, take a back up of your files, run the script on few samples.

If you have GNU-parallel, on your machine, you can try:

$ parallel  "awk -v OFS=\"\n\" '/^>/ {getline seq} {print \$0\"/\"FILENAME,seq}' {} > {.}_relabel.fa" ::: *.fa

you can also dry-run the code:

$ parallel  --dry-run "awk -v OFS=\"\n\" '/^>/ {getline seq} {print \$0\"/\"FILENAME,seq}' {} > {.}_relabel.fa" ::: *.fa
ADD COMMENTlink 17 months ago cpad0112 11k
Entering edit mode
0

Thank you!. The for loop did the trick. I didn't try the parallel options since I don't have GNU-parallel in my machine.

ADD REPLYlink 17 months ago
juan.galarza
• 0
Entering edit mode
0

Can you elaborate in why you do not have that? Is your reason covered on https://oletange.wordpress.com/2018/03/28/excuses-for-not-installing-gnu-parallel/

ADD REPLYlink 17 months ago
ole.tange
♦ 3.4k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0