Re: New Feature wanted: Is it possible to let git clone continue last break point?

Shawn Pearce <spearce@xxxxxxxxxxx> · Wed, 2 Nov 2011 17:06:53 -0700

On Wed, Nov 2, 2011 at 16:27, Jeff King <peff@xxxxxxxx> wrote:
> On Wed, Nov 02, 2011 at 03:41:36PM -0700, Junio C Hamano wrote:
>> Jeff King <peff@xxxxxxxx> writes:
>>
>> > Which is all a roundabout way of saying that the git protocol is really
>> > the sane way to do efficient transfers. An alternative, much simpler
>> > scheme would be for the server to just say:
>> >
>> >   - if you have nothing, then prime with URL http://host/bundle
>> >
>> > And then _only_ clone would bother with checking mirrors. People doing
>> > fetch would be expected to do it often enough that not being resumable
>> > isn't a big deal.
>>
>> I think that is a sensible place to start.

Yup, I agree. The "repo" tool used by Android does this in Python
right now[1].  Its a simple hack, if the protocol is HTTP or HTTPS the
client first tries to download $URL/clone.bundle. My servers have
rules that trap on */clone.bundle and issue an HTTP 302 Found response
to direct the client to a CDN. Works. :-)

[1] http://code.google.com/p/git-repo/source/detail?r=f322b9abb4cadc67b991baf6ba1b9f2fbd5d7812&name=stable

> OK. That had been my original intent, but somebody (you?) mentioned the
> "if you have X" thing at the GitTogether, which got me thinking.
>
> I don't mind starting slow, as long as we don't paint ourselves into a
> corner for future expansion. I'll try to design the data format for
> specifying the mirror locations with that extension in mind.

Right. Aside from the fact that $URL/clone.bundle is perhaps a bad way
to decide on the URL to actually fetch (and isn't supportable over
git:// or ssh://)... we should start with the clone case and worry
about incremental updates later.

> Even if the bundle thing ends up too wasteful, it may still be useful to
> offer a "if you don't have X, go see Y" type of mirror when "Y" is
> something efficient, like git:// at a faster host (i.e., the "I built 3
> commits on top of Linus" case).

Actually, I really think the bundle thing is wasteful. Its a ton of
additional disk. Hosts like kernel.org want to use sendfile() when
possible to handle bulk transfers. git:// is not efficient for them
because we don't have sendfile() capability.

Its also expensive for kernel.org to create each Git repository twice
on disk. The disk is cheap. Its the kernel buffer cache that is damned
expensive. Assume for a minute that Linus' kernel repository is a
popular thing to access. If 400M of that history is available in a
normal pack file on disk, and again 400M is available as a "clone
bundle thingy", kernel.org now has to eat 800M of disk buffer cache
for that one Git repository, because both of those files are going to
be hot.

I think I messed up with "repo" using a Git bundle file as its data
source. What we should have done was a bog standard pack file. Then
the client can download the pack file into the .git/objects/pack
directory and just generate the index, reusing the entire dumb
protocol transport logic. It also allows the server to pass out the
same file the server retains for the repository itself, and thus makes
the disk buffer cache only 400M for Linus' repository.

> Agreed. I was really trying to avoid protocol extensions, though, at
> least for an initial version. I'd like to see how far we can get doing
> the simplest thing.

One (maybe dumb idea I had) was making the $GIT_DIR/objects/info/packs
file contain other lines to list reference tips at the time the pack
was made. The client just needs the SHA-1s, it doesn't necessarily
need the branch names themselves. A client could initialize itself by
getting this set of references, creating temporary dummy references at
those SHA-1s, and downloading the corresponding pack file, indexing
it, then resuming with a normal fetch.

Then we wind up with a git:// or ssh:// protocol extension that
enables sendfile() on an entire pack, and to provide the matching
objects/info/packs data to help a client over git:// or ssh://
initialize off the existing pack files.

Obviously there is the existing security feature that over git:// or
ssh:// (or even smart HTTP), a deleted or rewound reference stops
exposing the content in the repository that isn't reachable from the
other reference tips. The repository owner / server administrator will
have to make a choice here, either the existing packs are not exposed
as available via sendfile() until after GC can be run to rebuild them
around the right content set, or they are exposed and the time to
expunge/hide an unreferenced object is expanded until the GC completes
(rather than being immediate after the reference updates).

But either way, I like the idea of coupling the "resumable pack
download" to the *existing* pack files, because this is easy to deal
with. If you do have a rewind/delete and need to expunge content,
users/administrators already know how to run `git gc --expire=now` to
accomplish a full erase. Adding another thing with bundle files
somewhere else that may or may not contain the data you want to erase
and remembering to clean that up is not a good idea.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html