Forum:Poll: Does your filesystem support xattr?
0
3
Entering edit mode
9.0 years ago
John 13k

Hello all :)

I use Extended Attributes extensively on my data, to keep track of which reference genome the data was mapped to, how it was mapped, the MD5 checksum of the file, bin size, etc etc, and I find it one of those really useful things that doesn't get the attention it deserves.

If you are not familiar with Extended Attributes, they are simply key/value pairs which you can add to your files that, hopefully, move with the data - http://en.wikipedia.org/wiki/Extended_file_attributes

For example, on Mac OSX:

xattr -w mapping mm9 ./mybam.bam

would store the key 'mapping' with a value of 'mm9' with the file ./mybam.bam

It can be read back with

xattr -p mapping ./mybam.bam

BAM files usually have the reference in the header, but for BigWig/BED/etc data this is very convenient. Another very practical application in my work has been to store the MD5 hashsum in the metadata, because our filenames/paths are always changing (!!), or to detect accidental filtering/truncation of data after it is created. For example, after adding the following two lines to the bashrc on OSX:

writehash() { for file do xattr -w filehash "$(md5 -q "$file")" "$file"; done; }
readhash() { for file do echo -n "$file"' : '; xattr -p filehash "$file"; done; }

It's easy to set the MD5 hash to the file(s) once, and then recall it instantly without having to re-hash the whole multi-gigabyte file(s) so you/your databases don't have to rely on file paths.

I'm sure others can think of some much more creative uses for metadata, and I'd very much like to hear them!

But, before I am really comfortable releasing code that makes use of metadata, I'm curious to know how many filesystems in Bioinformatic production use actually support it. The compute servers where I work do not, mainly because the file system is NFS which has to have Extended Attributes manually enabled when the file system is formatted.

Thus, I would be very grateful if people could comment with a yes or no, so we could get an idea of how prevalent it is. Note, xattr is a Mac binary. Check that wikipedia page for your distro's version - typically something like this should work on Linux:

touch somefile
setfattr -n "user.demo" -v "test" somefile
getfattr -n "user.demo" somefile

Thank you!!! :)

metadata xattr • 3.0k views
ADD COMMENT
2
Entering edit mode

really cool concept. even if it only were to work on a Mac would be useful to a lot of people. Large compute nodes run all kinds of filesytems AFS etc.

ADD REPLY
1
Entering edit mode

Just a note that one can add metadata to (compressed) BED with starch --note "foo bar baz..." and retrieve with unstarch --note, which has the nice feature of being independent of file system. You can put a lot of data in here, like a structured (query-able) and human-readable JSON string.

ADD REPLY
1
Entering edit mode

Yes, the concept is nice. In my case, I use it sometimes (on Linux) to tag banks files (SRA for instance) with the URL where they come from. I made once a little app that computed stats on the reads (things like min/max length) and tagged the reads file with them; one can then quickly know information about the bank from these tags without to have parsing the bank again.

Actually, even if such tags are not "inside" the file itself, I like to compare them like MP3 tags :)

ADD REPLY
0
Entering edit mode

Wow I like it! I thought MD5s take a long time to compute, but statistics like pileup-frequencies, coverage, total signal, etc, take orders of magnitude longer - and are frequently re-used in normalization steps, etc. A stats-appending tool for common Bioinformatic filetypes would be very useful :)

(but only if people can actually use Extended Metadata)

Maybe I'm thinking about this wrong - maybe the 'if you build it, they will come' philosophy would be better suited here.

ADD REPLY

Login before adding your answer.

Traffic: 1662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6