Re: Mercurial on BigTable

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> · Thu, 11 Jun 2009 21:14:28 -0700

Scott Chacon <schacon@xxxxxxxxx> wrote:
> Has anyone watched this yet?
> 
> http://code.google.com/events/io/sessions/MercurialBigTable.html

I hadn't seen that yet, thanks.

> It's kind of interesting - a Googler talks about getting Mercurial
> running on BigTable.  What fascinates me is that if I'm not horribly
> mistaken, it seems like they just threw out the revlog format entirely
> and just store the data in a key-value store as sort of a Git-like
> content addressable filesystem.

Almost... but not quite.  If you look at the way they store files
they embed the file path as part of the BigTable key.  This makes it
cheap to return all revisions between X and Y for any given file, as
its just a range scan over the keys.  Git doesn't do this normally.

In Hg, and in their implementation of it on BigTable, if a file
content is copied between two paths (same blob in git terms) they
actually duplicate the data, once under each path.  We could do
something like that in Git... and just pay the price on copy, and
then you can get a storage layout like they do, and have it scale
well onto a larger system.  But... pack size will suffer in what
the client receives, it will be bigger.

> Does anyone know how they do the graph walking efficiently with this
> structure?  He mentioned it was about half as fast as native Hg, but
> that seemed to be acceptable.  Curious if anyone had any thoughts or
> information on this.  Shawn, are there technical reasons why this
> works well the way they're doing it for Hg but would not for Git (like
> in the repo MINA based server)?  It looks like the data structure and
> protocol exchange are incredibly similar after they threw away all the
> revlog stuff.

I think they also added more pointers and data caches that don't
exist in Hg normally, but exist in their BigTable backend.  Like
precomputing pointers from a commit to the most recent ancestor
that is a merge, i think that was mentioned in the talk.

The JGit/MINA based servers run git "well enough", but that's off
local disk, and we do pay a good price compared to C Git.  E.g.
we really need a revcache to accelerate the object enumeration phase,
that takes ages in JGit.  And indexing a pushed pack is rather slow
compared to C Git, a large push could take up to a minute or two
to fully index and fsck.

> Or is it just that they're fine with the speed loss and
> the Android project would not be?

What does Android have to do with Hg?  Android went with Git for
a lot of reasons, none of them having to do with the performance
or availability of Hg on code.google.com.  All of them had to do
with Git being a really solid DVCS that has a very bright future.

-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html