Storing Large Flat Namespaces

"Jaap Suter" <git@xxxxxxxxxxxxx> · Sun, 7 Sep 2008 18:58:30 -0700

Hello,

I'm investigating the possibility of using Git to store a large flat
namespace. As an example, imagine a single directory containing
thousands or millions of files, each named using a 16-byte guid,
evenly distributed.

I'm aware that the Git object model makes various trade-offs,
typically in favor of managing source-tree layouts - which it does
extremely well. However, perhaps it is possible to carry some of Git's
features as a content revision tracker over to other storage
applications?

Currently, tree objects for large flat directories are quite large.
Doing a git-init, git-add, git-push on a flat directory with 10,000
files creates a tree object that is 24 kilobytes compressed. Any
change to a single file would create a whole new tree object, 24
kilobytes every time.

The obvious thought is to "virtualize" or "tree-ify" the namespace,
similar to how Git stores its own loose objects. (Is there a proper
name for this technique?)

This archive thread touches on that idea:

   http://kerneltrap.org/mailarchive/git/2006/2/8/200585

In particular (quoting Linus):

   "Final note: this means, for example, that git is relatively bad at
tracking a "hashed" nested file directory (like the one git itself
uses), because new files will end up randomly appearing in every
directory, and no directory is ever "stable".

I wonder how bad it is, so I did some calculations. What if Git could
"virtualize" the namespace when going from the working directory to
the object store, without the user ever realizing it. In other words,
Git would insert "fake" tree nodes to split apart large directories.
This leaves the question of how useful a working directory with a
million loose files is, but pretend for a moment that I have a Git
frontend that doesn't do full check-outs to a file system but allows a
user to grab single objects from the object store straight into
memory.

Ok, this is all back-of-the-napkin:

   H = sizeof(hash) = 20 bytes.
   G = sizeof(name) =  16 bytes.

   N = number of assets = let's say one million (hypothetical)

That means the theoretical minimum size of a tree object (without
virtualization) is:

  (H + G) * N = 36 megabytes

If we virtualize, using these parameters:

   B = number of bits we chop from the head of the filename, to create
the tree prefix (i.e., internal nodes)
   D = depth of virtualization (number of internal nodes we need to
traverse before we hit a bucket)

   (...in git's loose object store, B equals 8 and D equals 1.)

We get the following formulas:

                      2^(B*D)  = number of buckets (leaf tree objects)
                N / (2^(B*D)) = number of files per bucket
   (H+G) *  N / (2^(B*D)) = size of a bucket

     2^B * H = size of an internal node

The storage overhead of a single file commit then becomes the size of
a new bucket plus the size of the internal nodes back to the root.
I.e.,

    (D * (2^B) * H)     +     (H+G) * N / (2^(B*D))

The other interesting parameter is the total number of objects
involved in just virtualization overhead. For simplicity's sake, the
number of leaf objects is a good metric.

For different values of B and D, this leads to the following overhead
on a million-file directory.

    B = 0, D = 0 (how git treats large flat directories today)
        number of buckets: 1
        single change commit overhead: 36 megabyte

    B = 8, D = 1:
        number of buckets: 256
        single change commit overhead: 145736 bytes

    B = 8, D = 1:
        number of buckets: 65536
        single change commit overhead: 10780 bytes

    B = 3, D = 4:
        number of buckets: 4096
        single change commit overhead: 9424 bytes

Ultimately, the trade-off is between how may objects are involved in
the tree-ification, how many lookups need to be done to get an object,
and the storage overhead of commits.

The more I think about it, the less certain I am this kind of
data-structure is suitable for my application. I guess Linus's quote
at the top summed it up.

I'm wondering if anybody else has put any thought into this kind of
thing, and has suggestions for alternate data structures that could
enable Git to store large flat namespaces.

Thoughts are welcome. Or am I out to lunch?

Thanks,

Jaap Suter - http://jaapsuter.com
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html