Biostar Beta. Not for public use.
Question: Seeking for a Linux based script ?
1
Entering edit mode

Hello

I am planning to merge two files into one.

File A contains reads like this(Millions of reads)

>M02127_204_000000000-ARDDL_1_1114_18930_10163      ee=0.0266694    
TACGGAGGATCCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTTTGTTAAGTTAGAGGTGAAATTCCG

File B contains reads like this (Milllion of reads)

M02127_204_000000000-ARDDL_1_1114_18930_10163   Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Clostridiales_unclassified(100);Clostridiales_unclassified(100);

Both are 16S rRNA output file from Mothur_Miseq analysis. But for making a phylogetic tree, I should have to joing both files.

Good news is both files have contained same headers, I would like to make a single file(fileC<-FileA+FileB) with the common headers.

My Expected results could look like either File C

< M02127_204_000000000- ARDDL_1_1114_18930_10163:Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Clostridiales_unclassified(100);Clostridiales_unclassified(100);TACGGAGGATCCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTTTGTTAAGTTAGAGGTGAAATTCCG

*

or Final File C file starts like this

M02127_204_000000000-ARDDL_1_1114_18930_10163:Clostridiales_unclassified: ACGGAGGATCCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTTTGTTAAGTTAGAGGTGAAATTCCG

what could be the possible script?


Original Question

15:M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100. Its my sequances header,

I have to convert into >M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100

what could be the possible script?


Entering edit mode
1
echo '15:M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100' | sed 's/^.*:/>/'
>M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100
ADD REPLYlink 14 months ago
cpad0112
11k
Entering edit mode
0

Please edit the post, It's not clear

ADD REPLYlink 14 months ago
Asaf
5.6k
Entering edit mode
0

Hello I am planning to merge two files into one. File A contains reads like this(Millions of reads) >M02127_204_000000000-ARDDL_1_1114_18930_10163 ee=0.0266694 TACGGAGGATCCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTTTGTTAAGTTAGAGGTGAAATTCCG File B contains reads like this (Milllion of reads) M02127_204_000000000-ARDDL_1_1114_18930_10163 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Clostridiales_unclassified(100);Clostridiales_unclassified(100); Both are 16S rRNA output file from Mothur_Miseq analysis. But for making a phylogetic tree, I should have to joing both files. Good news is both files have contained same headers, I would like to make a single file(fileC<-FileA+FileB) with the common headers.

File C have to look like this M02127_204_000000000-ARDDL_1_1114_18930_10163>Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Clostridiales_unclassified(100);Clostridiales_unclassified(100);TACGGAGGATCCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTTTGTTAAGTTAGAGGTGAAATTCCG *

what could be the possible script?

ADD REPLYlink 14 months ago
B- for bioinformatics
• 20
Entering edit mode
1

You probably shouldn't edit your old question to replace it with a different one. Just make a new post if you have another question.

ADD REPLYlink 14 months ago
manuel.belmadani
• 830
Entering edit mode
0

You want to convert "15:" to ">"? Or in general, do you want to convert anything to the left of the colon, including the colon, to ">"? For any number of lines in a file? There are ways to do both. One could easily use sed, or awk, or perl, with a one-liner (so no real script required), but as Asaf says....you need to explain a little more.

ADD REPLYlink 14 months ago
seidel
6.8k
Entering edit mode
0

guskalja , please do not edit your question's entire premise after you get responses. That is bad etiquette and such a question will be closed in the future. If you wish to build on a previous question, open a new post and add a link to the previous post in the new post.

ADD REPLYlink 14 months ago
RamRS
21k
1
Entering edit mode

This can be tricky given that we don't know how other headers look like, or how the rest of your file is structured.

My assumptions:

1) The header is always <Numbers>:<Rest of header>
2) The file is structures as alternating between header/sequence only.
3) There's no : other than the one in the header.

Suggestion: Use sed to write a new file (output using > newfile.fasta) or modify in place with the -i switch and a regular expression.

Example:

Input

$ echo -e "15:M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100\nAATCGATCGATCGTAGTGACTGATCGATCGATCGTAGCTAGCTAGCTAGCTA\n" 
15:M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100
AATCGATCGATCGTAGTGACTGATCGATCGATCGTAGCTAGCTAGCTAGCTA

After piping to sed

$ echo -e "15:M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100\nAATCGATCGATCGTAGTGACTGATCGATCGATCGTAGCTAGCTAGCTAGCTA\n" | sed 's|^[[:digit:]]\+:|>|g'
>M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100
AATCGATCGATCGTAGTGACTGATCGATCGATCGTAGCTAGCTAGCTAGCTA

The regex in sed reads as Match any line start (^) of one to many digits ([[:digits:]] matches number, + is the one to many) followed by :. Replace with >.

See/execute man sed for more information.

ADD COMMENTlink 14 months ago manuel.belmadani • 830
Entering edit mode
0

If you're operating under those assumptions anyway, why not just use [^:]+: as the regex instead of ^[[:digit:]]+:?

ADD REPLYlink 14 months ago
RamRS
21k
Entering edit mode
0

Not sure what you mean but that doesn't work; it doesn't match any digits, the caret has to be outside the character class, and the colon should be trailing the match, not part of the character class.

ADD REPLYlink 14 months ago
manuel.belmadani
• 830
Entering edit mode
0

that doesn't work

Did you try with sed -r? Also, I did not write the entire regex, only the part that matches the characters preceding a colon. The whole regex would be: ^[^:]+:, meaning "match everything that's not a colon that's before a colon". By adding the ^ anchor, you can target character from index 0 to just before first colon.

ADD REPLYlink 14 months ago
RamRS
21k
Entering edit mode
0

Did you try with sed -r?

That seems like an important detail. Also, using extended regexs usually have some performance cost.

Also, I did not write the entire regex

You comment suggests replacing the entire regex I wrote with a partial one.

Finally, my version specifically grabs <Numbers>:, and not just <Everything>:. I stated my assumptions based on a somewhat conservative interpretation of what the header was (e.g. it starts numbers, not just anything) so your suggestion is effectively doing something different, with different assumptions. Post it as an answer if you feel strongly about it, maybe that's what the OP wanted.

ADD REPLYlink 14 months ago
manuel.belmadani
• 830
Entering edit mode
0

You comment suggests replacing the entire regex I wrote with a partial one.

You're right, I should have added the ^ anchor to my initial regex, the replacement I recommended did not substitute your regex completely.

ADD REPLYlink 14 months ago
RamRS
21k
Entering edit mode
0

I'd assumed that [[:digit:]] would work on GNU sed without the -r, but that doesn't seem to be the case. What this means is that your regex may also need the -r flag, thereby negating the performance cost gain point.

ADD REPLYlink 13 months ago
RamRS
21k
Entering edit mode
0

Seems to work for me.

$ sed --version
sed (GNU sed) 4.2.2
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Jay Fenlason, Tom Lord, Ken Pizzini,
and Paolo Bonzini.
GNU sed home page: <http://www.gnu.org/software/sed/>.
General help using GNU software: <http://www.gnu.org/gethelp/>.
E-mail bug reports to: <bug-sed@gnu.org>.
Be sure to include the word ``sed'' somewhere in the ``Subject:'' field.

u@r2:~$ echo -e "15:M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100\nAATCGATCGATCGTAGTGACTGATCGATCGATCGTAGCTAGCTAGCTAGCTA\n" | sed 's|^[[:digit:]]\+:|>|g'
>M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100
AATCGATCGATCGTAGTGACTGATCGATCGATCGTAGCTAGCTAGCTAGCTA
ADD REPLYlink 13 months ago
manuel.belmadani
• 830
Entering edit mode
0

Interesting. Does not work on GNU sed 4.0. This is kinda why I always use the -r flag - these differences annoy me.

ADD REPLYlink 13 months ago
RamRS
21k
0
Entering edit mode

Now that the answer is clear, my best suggestion is to write a small python script that will use biopython, read File B to a dictionary with the name as key and taxonomy as value and then read the fasta file with SeqIO, add the taxonomy using the dictionary to the id and print as fasta, should be very simple and take 20 lines of codes or so. Should be a good exercise as well.

ADD COMMENTlink 14 months ago Asaf 5.6k
Entering edit mode
0

Hi Asaf, I am not familiar with python. I don't even know, Where from I can begin with Python scripts, could suggest some good online tutorials?

ADD REPLYlink 14 months ago
B- for bioinformatics
• 20
Entering edit mode
0

biopython cookbook would be a good place to start: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc48

ADD REPLYlink 14 months ago
Asaf
5.6k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0