Tool:log / log.bio - keeping track of command line workflows
3
10
Entering edit mode
8.6 years ago
John 13k

Hello all :)

The ac.gt team (which, in all fairness, is just 2 people at the moment...) are starting to release some of the software we've made over the last two years, starting with the program "log". (Such SEO. Much Googlable.)

In a sentence, you prepend "log" to any command and it gets logged in a graph database.

Of course it can do a lot of other stuff, such as send you e-mail/phone call notifications when a command has finished running, create a pseudoterminal that acts almost exactly like a regular terminal (except everything gets logged), and ultimately was designed to create a community-owned database of Bioinformatic commands that will hopefully make ambiguous workflows in publications a thing of the past, but you'll have to check out the video below to see all that stuff :)

This is version 0.1, so still very much in alpha phase. We would really benefit from the experience of a *nix C programmer (although Python/JS/CSS programmers also very welcome!) but more than that, we would massively benefit from Alpha testers. We know it works well on our systems/servers/laptops, but knowing it works for others would really settle our nerves :)

Video of the tool in action: https://vimeo.com/138812995

Website where you can get the client/server/web code, and read a bit more about how to use the tool: http://ac.gt/log

Thank you for your time!

UPDATE:

Due primarily to advice/suggestions made on BioStars, and after fully discussing it over the weekend with the other AC.GT developers, we are going to totally re-work the log server to be private by default. One of these private servers will run under 'demo.log.bio', which users can play around with and learn what exactly is being logged. If they like what they see, they can run their own server at their institute (docker package and/or Amazon AMI) which will also be private by default. Users of a private server will be able to "publish" their logs to the existing public 'log.bio' database, on a per-command basis.

But before any of this will work, the Database Admin console has to be replaced with a more limited (but more useful) custom visualisation. This is going to take me some time, but I feel it will be very much worth it as doing common operations will not require knowledge of Neo4j/Cypher. It will also be easily embeddable into other websites/blogs.

In the mean time, you probably shouldn't use log. To follow the roll-out status check https://trello.com/b/1efgplCX/log-bio

ac.gt log.bio log • 2.7k views
ADD COMMENT
4
Entering edit mode
8.6 years ago
Daniel ★ 4.0k

First things first, I had no idea from the post and skimming the site that I was uploading all my logged commands and outputs to the public database. I assumed that it was a local database that was accessed through the browser. I don't really feel comfortable with my name, outputs and file system laid out bare online.

Second thing, I just tested it with log grep -c '>' sequences.fasta and it stripped the single quotes and outputted over the file, so sequences.fasta is now empty. Not good!

Thirdly, I have no clue how the node system works and don't really have the incentive to learn. I anticipated this would be an easy way to log my commands and then have them pop up nicely, but seems overly complex. Perhaps some stock queries for my own username would help.

ADD COMMENT
0
Entering edit mode

Sorry Daniel, to delete everything you logged run:

match (n {account:"YOURNAME"}) 
optional match (n)-[r]-()
delete n,r

where YOURNAME is your log.bio account. You can of course run your own logging database - I will do a full walk-through on how to do exactly that this evening after work, however it will require you run your own server.

Point two, and this is something I really should have explained in more detail in the video, when you log something directly from the command line and you use characters interpreted by the shell (such as quotes, but also pipes, $, etc) the shell actually strips/replaces them before they even get to log. In your command, log is given by bash the parameters [log,grep,-c,>,sequences.fasta] and it cannot know if > was a quoted string or not. In a sense it must be, because otherwise log would have only seen [log,grep,-c], but the more log plays around with what bash gives it, trying to figure out what the user actually typed, the more likely it is to make a mistake or cause a weird edge case. In interactive mode none of this is a problem, but for running log directly from the command line and using special characters, you unfortunately have to quote the command. For example :

log "grep -c '>' sequences.fasta"

Now bash will strip the quotes, and log will see your '>'. I know its a pain, which is actually why I made interactive mode in the first place, but without somehow integrating log right into bash I can't get around it.

Finally, regarding usability, please remember that the program is still totally in Alpha - this isn't something I just published in a journal - this is something I just got working. It's going to evolve with time and what that process will look like depends on criticisms/suggestions from people like yourself. I agree, end users should not need to know Cypher to interact with the graph, so I'm working on a custom visualisation just like the table but for viewing the data as a graph.

I'm really sorry it didn't work out for you Daniel - I appreciate your time trying it out, and I'll edit the website to reflect everything you've said. I wish I could get it all working tonight so I can somehow make up for wasting your time, but realistically this is going to take me a week to replace the raw database console with a more usable interface :(

ADD REPLY
1
Entering edit mode

Also please think about the privacy issue brought up by Daniel - having logs published publicly by default is as undesirable as it gets. Most people do not assume this, and even though it may be described on some pages that is not sufficient.

People explore tools before reading everything about them. Publishing logs publicly should be an opt in - and even in that case should carry lots of warnings.

ADD REPLY
0
Entering edit mode

Yes, agreed, this should definitely come with a warning when you create an account.... I'll make it so that you only get an API key after you click through a disclaimer.

ADD REPLY
0
Entering edit mode

Don't stress too much, this was supposed to be constructive criticism. Will check it out again once you've had more time to work on it.

ADD REPLY
1
Entering edit mode
8.6 years ago

Neat tools, but I'll link to my offsite comment about it:

ADD COMMENT
2
Entering edit mode

Sorry Matt! I didn't make it very clear that the client and server code is also on GitHub: https://github.com/JohnLonginotto/log.bio

The website code is not, partially because you can always grab the latest with:

for url in $(curl http://log.bio/all); do curl --create-dirs -o $url http://log.bio/$url; done

and partially because I get two messages a day about how the webpage looks terrible and I should change XYZ :P You were the first person actually to say it looks nice! (offsite)

Once we have a theme that everyone likes we'll stick with that and i'll move it on up to GH :)

About the code style, yeah it's not everyone's cup of tea, I'll own up to that - I'm sure as more people work on it I'll pick up some better habits :P

ADD REPLY
1
Entering edit mode

Let me know if you need help with the code style. One of these days we're going to have an internal MPI-IE hackathon/documentation-athon for deeptools and we could probably add log/log.bio.

ADD REPLY
1
Entering edit mode

I am very much down to help out Devon! log is no where near as developed as deeptools though... you guys work on that professionally - I just do log in the evenings whilst watching Game Of Thrones -_-;
Writing documentation and testing, etc, I enjoy doing so maybe I can help out there.

ADD REPLY
1
Entering edit mode
8.6 years ago

Conceptually interesting but even after watching the video I did not fully understand what type of problem does this tool solve.

Your video example is way to trivial, it is an echo into a file then showing the same file via cat - I never needed to visualize a workflow like that, it is too simple to bother, and in fact having a workflow for them is scary. Last thing I want is to have to visualize that as a graph.

Is log's purpose to create a track record of what went into an analysis? You should definitely formulate the rationale and utility of the tool in terms of where it helps one's work and not just in terms of what the tool does. Doing this latter you punt to the reader: you figure it out how this is helpful.

For example someone shows me an analysis, if I had a log perhaps I could identify all the files and information that were used. Is that what log does? I am not sure. It is not clear how log could track files that are used but not specified at command line, or if the file name is not quite the same as the information on the command line.

ADD COMMENT
0
Entering edit mode

First of all, thank you for taking the time to watch the video Istvan :)

Yes, I see what you're saying - by leaving it up to the reader's imagination what could be done with the tool, it loses any practical appeal... hm, you're right. I need to provide a better tutorial where a real-world analysis is logged, and how that would look. I kept it deliberately trivial so I wouldn't need to edit the video for time reasons - but it just so happens that I recently read a post on Biostars where someone created a sub-minute analysis with quite a few steps in it, all in bash which can be easily logged with log ;-) I can't think of a better demo!

The latter part of your post is about how does log really know what files are used/created/modified/deleted. The answer is, unfortunately, it doesn't - it will only say a file is used if it is mentioned in the command line (as a parameter, etc) and not if it is buried somewhere deep in a program's code. Created/modified/deleted is a bit easier, because all working/mentioned directories are compared before/after and it can be assumed 99% of the time that a new file in a working directory is the result of the command just logged, but again its not fool-proof. The only way to do this in a fool-proof manner is to write a shim in C which sits between the executed code and the OS, such that any call to open(2) is caught, and the file is MD5'd. If the program then asks the OS to write() or read() from that file handler, it will get marked as created/modified or used respectively. Although I know theoretically how to do this (and its not anything new - it is basically how strace/dtrace work), I am not the person for the job. I'm hoping someone else can come along and get that working - but in the mean time I still have to optimise the graph, the website, the documentation, etc, so still plenty to do :)

ADD REPLY

Login before adding your answer.

Traffic: 3020 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6