Re: Mercurial on BigTable

Andreas Ericsson <ae@xxxxxx> · Thu, 11 Jun 2009 04:02:45 +0200

Scott Chacon wrote:
Has anyone watched this yet?

http://code.google.com/events/io/sessions/MercurialBigTable.html

It's kind of interesting - a Googler talks about getting Mercurial
running on BigTable.  What fascinates me is that if I'm not horribly
mistaken, it seems like they just threw out the revlog format entirely
and just store the data in a key-value store as sort of a Git-like
content addressable filesystem.

It does indeed seem like that, yes. Would have been fun to be there to
congratulate him on implementing something that's already existed for
about three years ;-)

 I had thought they were taking
advantage of the revlog structure somehow, but it appears like they
basically just changed the underlying data format to be much more like
Git and rewrote ah Hg speaking server on top of that.  They even
explicitly store the head values like refs instead of reading
childless nodes out of the revlog, which is what I thought Hg did.

Well, storing the head values as refs is the only thing that makes
sense if you're using a database to track things, since you'd otherwise
have to map in too much data to get any sort of performance at all
out of it.

Does anyone know how they do the graph walking efficiently with this
structure?  He mentioned it was about half as fast as native Hg, but
that seemed to be acceptable.

Yes, so they don't. DAG walking means they have to look up several
changesets in a linear fashion, but if they don't know the order
up front they'll have to suffer the penalty of actually fetching
each commit from the bigtable database over the network. It would
be similar to storing git objects in a database on a different
host, which would also be quite a lot slower than just hitting an
mmap()'ed file in binary form.

 Curious if anyone had any thoughts or
information on this.  Shawn, are there technical reasons why this
works well the way they're doing it for Hg but would not for Git (like
in the repo MINA based server)?  It looks like the data structure and
protocol exchange are incredibly similar after they threw away all the
revlog stuff.  Or is it just that they're fine with the speed loss and
the Android project would not be?

I'm more curious as to why they didn't choose git. The only explanation
that was actually true is that hg works well over HTTP (if you can call
3 network requests per not-up-to-date head "well"). Since I can't imagine
them not doing proper research before launching a project that almost
certainly cost quite a lot of money, and I personally think that the
"http rules all" explanation sounded weak, I'm guessing there were other
reasons as to why they didn't go with git instead, and I'm fairly curious
to hear them. If I was to take a guess, I'd say git is written in a pretty
unfriendly way for implementing other storage engines.

Ah well. In a year or two they'll probably support git as well. One can
hope at least ;-)

--
Andreas Ericsson                   andreas.ericsson@xxxxxx
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html