Re: Resumable clone

Duy Nguyen <pclouds@xxxxxxxxx> · Tue, 8 Mar 2016 18:11:19 +0700

On Tue, Mar 8, 2016 at 10:33 AM, Kevin Wern <kevin.m.wern@xxxxxxxxx> wrote:
> Hey Junio and Duy,
>
> Thank you for your thorough responses! I'm new to git dev, so it's
> extremely helpful.
>
>> - The server side endpoint does not have to be, and I think it
>> should not be, implemented as an extension to the current
>> upload-pack protocol. It is perfectly fine to add a new "git
>> prime-clone" program next to existing "git upload-pack" and
>> "git receive-pack" programs and drive it through the
>> git-daemon, curl remote helper, and direct execution over ssh.
>
> I'd like to work on this, and continue through to implementing the
> prime_clone() client-side function.

Great! Although I think you started with the most configurable part,
something to work out I guess.

> From what I understand, a pattern exists in clone to download a
> packfile when a desired object isn't found as a resource. In this
> case, if no alternative is listed in http-alternatives, the client
> automatically checks the pack index(es) to see which packfile contains
> the object it needs.

I don't follow this. What is "a resource"?

> However, the above is a fallback. What I believe *doesn't* exist is a
> way for the server to say, "I have a resource, in this case a
> full-history packfile

ah a resource could be the pack file to be downloaded, ok..

> , and I *prefer* you get that file instead of
> attempting to traverse the object tree." This should be implemented in
> a way that is extensible to other resource types moving forward.
>
> I'm not sure how the server should determine the returned resource. A
> packfile alone does not guarantee the full repo history

That's for the later part. At this point, I think the update "git
clone" will request the new service you're writing and ask "do you
have a resumable pack I can download?" and it can return an URL. Then
prime_clone() proceeds to download and figure out what's in that pack.

Yeah the determination is tricky, it depends on server setup. Let's
start with select the pack for download first because there could be
many of them. A heuristic (*) of choosing the biggest one in
$GIT_DIR/objects/pack is probably ok for now (we don't need full
history, "the biggest part of history" is good enough). Then we get
the pack file name, which can be used as pack ID.

For the simplest setup, I suppose the admin would give us an URL
prefix (or multiple prefixes), e.g. http://myserver.com/cache-here/
and we are supposed to append the pack file name to it, and the full
URL would be http://myserver.com/cache-here/pack-$SHA1.pack. This is
what the new service will return to git-clone.

For more complex setup, I guess the admin can provide a script that
takes pack id as key and returns the list of URLs for us. They can
give us the path to this script via config file.

(*) The source of producing this cached pack (and maybe sending them
to CDN) is git-repack. But when it's done and how it's done is really
up to admins. So the admin really needs to provide us a script or
something that provides this info back, if we want to avoid
heuristics. Such a script can even choose to ignore the given pack id
and output URLs based on repository identify only.

> , and I'm not
> positive checking the idx file for HEAD's commit hash ensures every
> sub-object is in that file (though I feel it should, because it is
> delta-compressed). With that in mind, my best guess at the server
> logic for packfiles is something like:
>
> Do I have a full history packfile, and am I configured to return one?
> - If yes, then return an answer specifying the file url and type (packfile)
> - Otherwise, return some other answer indicating the client must go
> through the original cloning process (or possibly return a different
> kind of file and type, once we expand that capability)

Well, the lack of this new service should be enough for git-clone to
fall back to normal cloning protocol. The admin must enable this
service in git-daemon first if they want to use it. If there's no
suitable URL to show, it's ok to just disconnect. git-clone must be
able to deal with that and fall back.

> Which leaves me with questions on how to test the above condition. Is
> there an expected place, such as config, where the user will specify
> the type of alternate resource, and should we assume some default if
> it isn't specified? Can the user optionally specify the exact file to
> use (I can't see why because it only invites more errors)? Should the
> specification of this option change git's behavior on update, such as
> making sure the full history is compressed? Does the existence of the
> HEAD object in the packfile ensure the repo's entire history is
> contained in that file?

I think some of these questions are basically "ask admins", the other
half we can deal with when implementing prime_clone().

Following up what I wrote above. Suppose your service's name is
clone-download (or any other name), the config variable
daemon.clonedownload must be set in order to turn this service on.
Without it, git-clone falls back to normal clone. Then we could either
have config clonedownload.prefix or clonedownload.script. Or something
like that.

Also, we made a mistake before with transport protocol where the
server talks first. I'm thinking this time, maybe we do it
differently. git-clone connects to the service, then tells the server
about its capability and whatever else (e.g. physical localtion, the
max number of URLs it wants to receive...). The server waits for that
first, then it can send URLs back to the client, then disconnect.

Unless there's something very sepcial, I think we just follow current
protocol and use pkt-line for communication (see "pkt-line Format" in
Documentation/technical/pack-protocol.txt). We send one URL per
pkt-line, terminated with null pkt-line.

All this is about full duplex connections (git:// or ssh://). Smart
http is not mentioned. But I think a simple GET request is enough for
this (you'll have to touch http.c, but that can wait).

> Also, for now I'm assuming the same options should be available for
> prime-clone as are available for upload-pack (--strict,
> --timeout=<n>). Let me know if any other features are necessary.
> Also, let me know if I'm headed in the complete wrong direction...

Heh.. I didn't know upload-pack took  --timeout :D --strict should be
there (repository discovery is the same everywhere). Not so sure about
--timeout (upload-pack can take a long time so timeout gives more
resource control, this new services should be instant)
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html