Seeking for a Linux based script ?
2
1
Entering edit mode
5.2 years ago

Hello

I am planning to merge two files into one.

File A contains reads like this(Millions of reads)

>M02127_204_000000000-ARDDL_1_1114_18930_10163      ee=0.0266694    
TACGGAGGATCCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTTTGTTAAGTTAGAGGTGAAATTCCG

File B contains reads like this (Milllion of reads)

M02127_204_000000000-ARDDL_1_1114_18930_10163   Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Clostridiales_unclassified(100);Clostridiales_unclassified(100);

Both are 16S rRNA output file from Mothur_Miseq analysis. But for making a phylogetic tree, I should have to joing both files.

Good news is both files have contained same headers, I would like to make a single file(fileC<-FileA+FileB) with the common headers.

My Expected results could look like either File C

< M02127_204_000000000- ARDDL_1_1114_18930_10163:Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Clostridiales_unclassified(100);Clostridiales_unclassified(100);TACGGAGGATCCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTTTGTTAAGTTAGAGGTGAAATTCCG

*

or Final File C file starts like this

M02127_204_000000000-ARDDL_1_1114_18930_10163:Clostridiales_unclassified: ACGGAGGATCCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTTTGTTAAGTTAGAGGTGAAATTCCG

what could be the possible script?


Original Question

15:M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100. Its my sequances header,

I have to convert into >M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100

what could be the possible script?


alignment Assembly • 1.5k views
ADD COMMENT
1
Entering edit mode
echo '15:M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100' | sed 's/^.*:/>/'
>M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100
ADD REPLY
0
Entering edit mode

Please edit the post, It's not clear

ADD REPLY
0
Entering edit mode

Hello I am planning to merge two files into one. File A contains reads like this(Millions of reads) >M02127_204_000000000-ARDDL_1_1114_18930_10163 ee=0.0266694 TACGGAGGATCCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTTTGTTAAGTTAGAGGTGAAATTCCG File B contains reads like this (Milllion of reads) M02127_204_000000000-ARDDL_1_1114_18930_10163 Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Clostridiales_unclassified(100);Clostridiales_unclassified(100); Both are 16S rRNA output file from Mothur_Miseq analysis. But for making a phylogetic tree, I should have to joing both files. Good news is both files have contained same headers, I would like to make a single file(fileC<-FileA+FileB) with the common headers.

File C have to look like this M02127_204_000000000-ARDDL_1_1114_18930_10163>Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Clostridiales_unclassified(100);Clostridiales_unclassified(100);TACGGAGGATCCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTTTGTTAAGTTAGAGGTGAAATTCCG *

what could be the possible script?

ADD REPLY
1
Entering edit mode

You probably shouldn't edit your old question to replace it with a different one. Just make a new post if you have another question.

ADD REPLY
0
Entering edit mode

You want to convert "15:" to ">"? Or in general, do you want to convert anything to the left of the colon, including the colon, to ">"? For any number of lines in a file? There are ways to do both. One could easily use sed, or awk, or perl, with a one-liner (so no real script required), but as Asaf says....you need to explain a little more.

ADD REPLY
0
Entering edit mode

guskalja , please do not edit your question's entire premise after you get responses. That is bad etiquette and such a question will be closed in the future. If you wish to build on a previous question, open a new post and add a link to the previous post in the new post.

ADD REPLY
1
Entering edit mode
5.2 years ago

This can be tricky given that we don't know how other headers look like, or how the rest of your file is structured.

My assumptions:

1) The header is always <Numbers>:<Rest of header>
2) The file is structures as alternating between header/sequence only.
3) There's no : other than the one in the header.

Suggestion: Use sed to write a new file (output using > newfile.fasta) or modify in place with the -i switch and a regular expression.

Example:

Input

$ echo -e "15:M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100\nAATCGATCGATCGTAGTGACTGATCGATCGATCGTAGCTAGCTAGCTAGCTA\n" 
15:M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100
AATCGATCGATCGTAGTGACTGATCGATCGATCGTAGCTAGCTAGCTAGCTA

After piping to sed

$ echo -e "15:M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100\nAATCGATCGATCGTAGTGACTGATCGATCGATCGTAGCTAGCTAGCTAGCTA\n" | sed 's|^[[:digit:]]\+:|>|g'
>M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100
AATCGATCGATCGTAGTGACTGATCGATCGATCGTAGCTAGCTAGCTAGCTA

The regex in sed reads as Match any line start (^) of one to many digits ([[:digits:]] matches number, + is the one to many) followed by :. Replace with >.

See/execute man sed for more information.

ADD COMMENT
0
Entering edit mode

If you're operating under those assumptions anyway, why not just use [^:]+: as the regex instead of ^[[:digit:]]+:?

ADD REPLY
0
Entering edit mode

Not sure what you mean but that doesn't work; it doesn't match any digits, the caret has to be outside the character class, and the colon should be trailing the match, not part of the character class.

ADD REPLY
0
Entering edit mode

that doesn't work

Did you try with sed -r? Also, I did not write the entire regex, only the part that matches the characters preceding a colon. The whole regex would be: ^[^:]+:, meaning "match everything that's not a colon that's before a colon". By adding the ^ anchor, you can target character from index 0 to just before first colon.

ADD REPLY
0
Entering edit mode

Did you try with sed -r?

That seems like an important detail. Also, using extended regexs usually have some performance cost.

Also, I did not write the entire regex

You comment suggests replacing the entire regex I wrote with a partial one.

Finally, my version specifically grabs <Numbers>:, and not just <Everything>:. I stated my assumptions based on a somewhat conservative interpretation of what the header was (e.g. it starts numbers, not just anything) so your suggestion is effectively doing something different, with different assumptions. Post it as an answer if you feel strongly about it, maybe that's what the OP wanted.

ADD REPLY
0
Entering edit mode

You comment suggests replacing the entire regex I wrote with a partial one.

You're right, I should have added the ^ anchor to my initial regex, the replacement I recommended did not substitute your regex completely.

ADD REPLY
0
Entering edit mode

I'd assumed that [[:digit:]] would work on GNU sed without the -r, but that doesn't seem to be the case. What this means is that your regex may also need the -r flag, thereby negating the performance cost gain point.

ADD REPLY
0
Entering edit mode

Seems to work for me.

$ sed --version
sed (GNU sed) 4.2.2
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Jay Fenlason, Tom Lord, Ken Pizzini,
and Paolo Bonzini.
GNU sed home page: <http://www.gnu.org/software/sed/>.
General help using GNU software: <http://www.gnu.org/gethelp/>.
E-mail bug reports to: <bug-sed@gnu.org>.
Be sure to include the word ``sed'' somewhere in the ``Subject:'' field.

u@r2:~$ echo -e "15:M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100\nAATCGATCGATCGTAGTGACTGATCGATCGATCGTAGCTAGCTAGCTAGCTA\n" | sed 's|^[[:digit:]]\+:|>|g'
>M02127_204_000000000-ARDDL_1_2106_8332_24370 Bacteria(100
AATCGATCGATCGTAGTGACTGATCGATCGATCGTAGCTAGCTAGCTAGCTA
ADD REPLY
0
Entering edit mode

Interesting. Does not work on GNU sed 4.0. This is kinda why I always use the -r flag - these differences annoy me.

ADD REPLY
0
Entering edit mode
5.2 years ago
Asaf 10k

Now that the answer is clear, my best suggestion is to write a small python script that will use biopython, read File B to a dictionary with the name as key and taxonomy as value and then read the fasta file with SeqIO, add the taxonomy using the dictionary to the id and print as fasta, should be very simple and take 20 lines of codes or so. Should be a good exercise as well.

ADD COMMENT
0
Entering edit mode

Hi Asaf, I am not familiar with python. I don't even know, Where from I can begin with Python scripts, could suggest some good online tutorials?

ADD REPLY
0
Entering edit mode

biopython cookbook would be a good place to start: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc48

ADD REPLY

Login before adding your answer.

Traffic: 2385 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6