Re: git over webdav: what can I do for improving http-push ?

Jan Hudec <bulb@xxxxxx> · Thu, 3 Jan 2008 22:15:21 +0100

On Thu, Jan 03, 2008 at 20:14:09 +0100, Grégoire Barbier wrote:
> Jan Hudec a écrit :
[...]
>> It is what bzr and mercurial do and I think it would be quite good way to go
>> for cases like this.
> Ok, I will have to look at bzr and mercurial...

Bzr is quite far, design-wise, I fear. The mercurial might be a little more
interesting to study, but being in python and internally somewhat
file-oriented, I wouldn't think it is of much use.

You should start with upload, leaving the download direction to the dumb
machinery git currently uses.

[...]
>> I have also thought about optimizing download using CGI, but than I thought,
>> that maybe there is a way to statically generate packs so, that if the client
>> wants n revisions, the number of revisions it downloads is O(n) and the
>> number of packs it gets them from (and thus number of round-trips) is
>> O(log(n)). Assuming the client always wants everything up to the tip, of
>> course. Now this is trivial with linear history (pack first half, than half
>> of what's left, etc., gives logarithmic number of packs and you always
>> download at most twice as much as you need), but it would be nice if somebody
>> found a way (even one that satisfies the conditions on average only) to do
>> this with non-linear history, it would be very nice improvement to the http
>> download -- native git server optimizes amount of data transfered very well,
>> but at the cost of quite heavy CPU load on the server.
>>   
> Well... frankly I don't think I'm able of such things.
> Writing a walker over webdav or  a simple cgi is a thing I can do (I 
> think),  but I'm not tought enough (or not ready to take the time needed) 
> to have a look on the internals of packing revisions (whereas I can imagine 
> it would means that "my" walker would be suitable only for small projects 
> in terms of code amount and commit frequency).

Well, it does not depend on the walker -- the walker is quite simple and
already written anyway.

> I had a quick look on bzr and hg, and it seems that bzr use the easy way 
> (walker, no optimizations)

That's not quite true -- bzr has both dumb (walker over plain HTTP) and smart
(CGI) methods. But their CGI is really just tunelling their custom protocol
over HTTP and that protocol will not be anywhere near what we want for git
because of vastly different design of the storage.

> and hg a cgi (therefore, maybe optimizations). 
> By quick look I mean that I sniff the HTTP queries on the network during a 
> clone. I need to look harder...

Yes, mercurial uses a CGI. But I am not sure how similar their approach is to
anything that would make sense for git, so looking at the details might or
might not be useful.

> BTW I never looked at the git:// protocol. Do you think that by tunneling 
> the git protocol in a cgi (hg uses URLs of the form 
> "/mycgi?cmd=mycommand&...", therefore I think "tunnel" is not a bad 
> word...) the performance would be good?

It would be pretty hard to tunnel it and it would loose all it's nice
properties. The git protocol, for pull, basically works like this:

 - server sends a list of it's refs
 - client tells server which ones it wants
 - client starts listing revisions it has, newest to oldest
 - server tells client whenever it finds common ancestor with one of the
   heads desired
 - client restarts the listing from next ref
 - server starts sending the data when client runs out of refs to list

The main point about the protocol is, that the client is listing the refs, as
fast as it can and server will stop it when it sees a revision it knows.
Therefore there will only be one round-trip to discover each common ancestor.

However, you can't do this over HTTP, because response won't be started until
the request is received. You could be sending a lot of smallish requests and
quick, often empty, responses to them. However, that will waste a lot of
bandwidth (because of the HTTP overhead) and loose much of the speed anyway.
Also the HTTP protocol is stateless, but this is inherently stateful, so you
would have to work that around somehow too. Therefore a different approach is
preferable on HTTP.

Now to keep it stateless, I thought that:
 - client would first ask for list of refs
 - client would than ask for pack containing the first ref
 - server would respond with pack containing just that commit plus all
   objects that are not referenced by any of it's parents
 - if client does not have it's parent, it would ask for pack containing that
 - since it's second request, server would pack 2 revisions (with necessary
   objects) this time
 - if client still does not have all parents, it would again ask for a pack,
   receiving 4 revisions this time (next 8, 16, etc...)

This would guarantee, that when you want n revisions, you make at most
log2(n) requests and get at most 2*n revisions (well, the requests are for
each ref separately, so it's more like m*log2(n) where m is number of refs,
but still). Additionally, it would be stateless, because the client would
simply say 'I want commit abcdef01 and this is my 3rd request' and server
would provide that commit and 7 it's parents (ie. 2^3 commits).

Now generating the packs takes it's CPU. The servers like git.kernel.org have
quite high CPU load. But in this schema, all clients would most of the time
get the same packs (unlike native git protocol, where the client always gets
single pack with exactly what it needs). So the idea struck me, that they
could simply be statically generated and fetched via the existing dumb
protocol. It would get the efficiency and save a lot of CPU power, which
is allows one to serve quite busy git repository from a limited (and
therefore cheap) virtual machine or even (yes, I saw such idea on the list)
serving any repository from NSLU2.

Now to create a packing policy to create such packs, you don't actually need
to touch any C -- because git-repack is still in shell -- and you don't
really need to touch any internals, because you only need to change how the
pack-objects command will be called and leave all the dirty details to that
command.

I would personally not re-split the already generated packs. Only find some
algorithm for choosing when packs are deep enough in history so they should
be merged together. It would also might not make sense to ever pack unpacked
objects to more than one pack -- a dumb-HTTP-served might have a requirement
of running this kind of repack after every push and clients will rarely have
part of single push to the server.

> Maybe it's not that hard to write a performant HTTP/CGI protocol for Git if 
> it's based upon existing code such as the git protocol.

For push it might, or might not be easy. But in the worst case you should be
able to calculate a pack to upload locally (fetching from the server
beforehead if necessary), upload that pack (or bundle) and update all the
refs.

For pull it certainly won't be. You might be able to reimplement the common
ref discovery using some kind of gradually increasing ref list and then
generate a bundle for the server, but optimizing the dumb protocol seems more
useful to me. As I said, generating the packs only requires devising a way of
selecting which objects should go together and git pack-objects will take
care of the dirty details of generating the packs and git update-server-info
will take care of the dirty details of presenting the list of packs to
client.

-- 
						 Jan 'Bulb' Hudec <bulb@xxxxxx>
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html