Re: [RFC] Plumbing-only support for storing object metadata

"Stephen R. van den Berg" <srb@xxxxxxx> · Sun, 10 Aug 2008 13:20:38 +0200

Shawn O. Pearce wrote:
>The proper encoding for both keys and values should permit any data
>to be stored.  Doesn't the extended attributes feature in Linux and
>FreeBSD both support any data to be attached to an inode in the fs?

I'd think so yes, so any attempt to store the metadata should support it
as well.
That also would imply that any such metadata storage would have to allow
for arbitrary blobs to be stored under tag-names.
And *that* would imply that anything that implements a kludge like
specifying a flat-file format to encode name/value pairs doesn't scale.

>I think this is a _BAD_ idea.

>A bad idea that will only clutter up the core object model, and
>the core processing code of that object model.  Extended attributes
>aren't used that much on local filesystems, because they are hard
>to work with and suck performance wise.  Performance in Git is
>a _feature_.  It matters.  Our clean object model really helps to
>make that possible.

Quite right.

However, pondering the idea a bit more, I could envision something
similar to the following:

In the git tree the following layout would be used:

plainfile.txt
otherdir/otherplainfile.txt
projects/README
projects/README/_owner
projects/README/_acl
projects/README/_icon
projects/README/_mimetype
projects/something.mpeg
projects/something.mpeg/_icon
projects/something.mpeg/_mimetype
projects/asubdir/thirdplainfile.txt

That would imply that in the tree storage, the only extension would be
that for any given reference to a blob in a tree object, there could be
a reference to a tree object as well.  I.e. something like this in the
tree object:

100644 blob f7b7414159b8a7159538fac543b2b19ef531968e  README
000000 tree df6ee415f04d6ccea5dab0de562c2f155583a2c4  README
100644 blob 0a54f8ec13df03cf6bdb5b973acec6d8141c01cc  something.mpeg
000000 tree a421448d765abb7bb979dc1d56621d0fc9b41229  soemthing.mpeg

The extra tree reference for README would actually refer to something like:

100644 blob be3365fdaae0f4ed8c22c4cf38a4b1f88f9069c3  _owner
100644 blob 739e9e8f3d095931084b54cbf7f90d8f64eb0ac6  _acl
100644 blob bc1a868bb50644712966a50150d21199c401d6d5  _icon
100644 blob 6076bde5b3b6b8bed4ec4968d09abdbf015b3b75  _mimetype

Which would contain the extra attributes.

And that would imply that during checkout you can do a rich checkout or a
flat checkout for any files under the projects directory.

A flat checkout results in the following files in the filesystem:

plainfile.txt
otherdir/otherplainfile.txt
projects/README
projects/README.attr/_owner
projects/README.attr/_acl
projects/README.attr/_icon
projects/README.attr/_mimetype
projects/something.mpeg
projects/something.mpeg.attr/_icon
projects/something.mpeg.attr/_mimetype
projects/asubdir/thirdplainfile.txt

A rich checkout results in the following files in the filesystem:

plainfile.txt
otherdir/otherplainfile.txt
projects/README
projects/something.mpeg
projects/asubdir/thirdplainfile.txt
projects/asubdir/fourthplainfile.txt

The rich checkout also applies the extended attributes/metadata to the
filesystem (i.e. it would store all the metadata in the appropriate
places).

The nice thing about this setup is that:
a. There is *no* change whatsoever to existing repositories or
   repositoryformat.
b. It's less filling (i.e. there are no special bits or object types to be
   used).
c. Speed for files without attributes is not affected.
d. It's fully 8-bit-transparent.
e. It scales, even if you have large or many attributes.
f. It uses the natural tree storage abstraction already supported in
   git repositories to store the additional data.
g. It allows reuse of attribute information at many levels.
h. It even allows for a hierarchy of attributes attached to a single
   file (no current filesystem supports that (yet)).
i. The only change in the fast-path of core-git is that it would have to
   know how to skip tree objects referenced in a tree object if a
   same-name blob object is already there.  This can even be optimised
   by requiring the attribute-tree to have a very specific (e.g. 0)
   mode to ease detection.
j. Editing and merging the meta-information could be made an almost
   natural operation in the flat-checkout mode (the extension to be used
   to name the attribute subdir should be made configurable).
-- 
Sincerely,
           Stephen R. van den Berg.

Real programmers don't produce results, they return exit codes.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html