Re: git-daemon on NSLU2

"Jon Smirl" <jonsmirl@xxxxxxxxx> · Sat, 25 Aug 2007 11:44:07 -0400

On 8/24/07, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> > I can clone the tree in five minutes using the http protocol. Using the
> > git protocol would take 24hrs if I let it finish.
>
> The http side doesn't actually do any global verification, the way
> git-daemon does. So to it, everything is just temporary buffers, and you
> don't need any memory at all, really.
>
> git-daemon will create a packfile. That means that it has to generate the
> *global* object reachability, and will then optimize the object packing
> etc etc. That's a minimum of something like 48 bytes per object for just
> the object chains, and the kernel has a *lot* of objects (over half a
> million).

A large, repeating work load is created in this process when you take
a 200MB pack, repack it to add a few loose objects and then don't save
the results. This model makes the NSLU2 unusable, but I also see it at
my shared hosting provider. Initial clones of a repo that take 3min
from kernel.org take 25min on a shared host since the RAM is not
dedicated.

There are three categories of fetches:
1) initial clone, fetch all
2) fetch recent
3) I haven't fetched in three months

99% of fetches fall in the first two categories.

A very simple solution is to sendfile() existing packs if they contain
any objects that the client wants and let the client deal with the
unwanted objects. Yes this does send extra traffic over the net, but
the only group significantly impacted is #2 which is the most
infrequent group.

Loose objects are handled as they are currently. To optimize this
scheme you need to let the loose objects build up at the server and
then periodically sweep only the older ones into a pack. Packing the
entire repo into a single pack would cause recent fetches to retrieve
the entire pack.

Initial clone can be optimized further by recognizing that the
receiving repository is empty and sending them everything; no need to
compute which objects are missing at the server. This method will
speed up initial clone since the existing pack can be immediately sent
instead of waiting on a pack file to be built. Build the loose object
pack in parallel with sending the existing packs.

I recognize that in the case of cloning a single branch or --reference
too many objects will also be transmitted but I believe the benefits
of reducing the server load outweigh the overhead of transmitting
extra objects in this case. You can always remove the extra objects on
the client side.

On 8/24/07, Jakub Narebski <jnareb@xxxxxxxxx> wrote:
> There was idea to special case clone (just concatenate the packs, the
> receiving side as someone told there can detect pack boundaries; do not
> forget to pack loose objects, first), instead of using generic fetch --all
> for clone, bnut no code. Code speaks louder than words (although if someone
> would provide details of pack boundary detection...)

Write the file name and length into the socket before sending the
pack. Use sendfile() or it's current incarnation to actually send the
pack. Insert these header lines between packs.

> In addition to the object chains yourself, the native protocol will also
> obviously have to actually *look* at and parse all the tree and commit
> objects while it does all this, so while it doesn't necessarily keep all
> of those in memory all the time, it will need to access them, and if you
> don't have enough memory to cache them, that will add its own set of IO.
>
> So I haven't checked exactly how much memory you really want to have to
> serve big projects, but with some handwavy guesstimate, if you actually
> want to do a good job I'd guess that you really want to have at least as
> much memory as the size of largest project you are serving, and probably
> add at least 10-20% on top of that.
>
> So for the kernel, at a guess, you'd probably want to have at least 256MB
> of RAM to do a half-way good job. 512MB is likely nicer and allows you to
> actually cache the stuff over multiple accesses.
>
> But I haven't actually tested. Maybe it might be bearable at 128M.
>
>                         Linus
>

-- 
Jon Smirl
jonsmirl@xxxxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html