On Tue, Mar 8, 2016 at 10:33 AM, Kevin Wern <kevin.m.wern@xxxxxxxxx> wrote: > Hey Junio and Duy, > > Thank you for your thorough responses! I'm new to git dev, so it's > extremely helpful. > >> - The server side endpoint does not have to be, and I think it >> should not be, implemented as an extension to the current >> upload-pack protocol. It is perfectly fine to add a new "git >> prime-clone" program next to existing "git upload-pack" and >> "git receive-pack" programs and drive it through the >> git-daemon, curl remote helper, and direct execution over ssh. > > I'd like to work on this, and continue through to implementing the > prime_clone() client-side function. Great! Although I think you started with the most configurable part, something to work out I guess. > From what I understand, a pattern exists in clone to download a > packfile when a desired object isn't found as a resource. In this > case, if no alternative is listed in http-alternatives, the client > automatically checks the pack index(es) to see which packfile contains > the object it needs. I don't follow this. What is "a resource"? > However, the above is a fallback. What I believe *doesn't* exist is a > way for the server to say, "I have a resource, in this case a > full-history packfile ah a resource could be the pack file to be downloaded, ok.. > , and I *prefer* you get that file instead of > attempting to traverse the object tree." This should be implemented in > a way that is extensible to other resource types moving forward. > > I'm not sure how the server should determine the returned resource. A > packfile alone does not guarantee the full repo history That's for the later part. At this point, I think the update "git clone" will request the new service you're writing and ask "do you have a resumable pack I can download?" and it can return an URL. Then prime_clone() proceeds to download and figure out what's in that pack. Yeah the determination is tricky, it depends on server setup. Let's start with select the pack for download first because there could be many of them. A heuristic (*) of choosing the biggest one in $GIT_DIR/objects/pack is probably ok for now (we don't need full history, "the biggest part of history" is good enough). Then we get the pack file name, which can be used as pack ID. For the simplest setup, I suppose the admin would give us an URL prefix (or multiple prefixes), e.g. http://myserver.com/cache-here/ and we are supposed to append the pack file name to it, and the full URL would be http://myserver.com/cache-here/pack-$SHA1.pack. This is what the new service will return to git-clone. For more complex setup, I guess the admin can provide a script that takes pack id as key and returns the list of URLs for us. They can give us the path to this script via config file. (*) The source of producing this cached pack (and maybe sending them to CDN) is git-repack. But when it's done and how it's done is really up to admins. So the admin really needs to provide us a script or something that provides this info back, if we want to avoid heuristics. Such a script can even choose to ignore the given pack id and output URLs based on repository identify only. > , and I'm not > positive checking the idx file for HEAD's commit hash ensures every > sub-object is in that file (though I feel it should, because it is > delta-compressed). With that in mind, my best guess at the server > logic for packfiles is something like: > > Do I have a full history packfile, and am I configured to return one? > - If yes, then return an answer specifying the file url and type (packfile) > - Otherwise, return some other answer indicating the client must go > through the original cloning process (or possibly return a different > kind of file and type, once we expand that capability) Well, the lack of this new service should be enough for git-clone to fall back to normal cloning protocol. The admin must enable this service in git-daemon first if they want to use it. If there's no suitable URL to show, it's ok to just disconnect. git-clone must be able to deal with that and fall back. > Which leaves me with questions on how to test the above condition. Is > there an expected place, such as config, where the user will specify > the type of alternate resource, and should we assume some default if > it isn't specified? Can the user optionally specify the exact file to > use (I can't see why because it only invites more errors)? Should the > specification of this option change git's behavior on update, such as > making sure the full history is compressed? Does the existence of the > HEAD object in the packfile ensure the repo's entire history is > contained in that file? I think some of these questions are basically "ask admins", the other half we can deal with when implementing prime_clone(). Following up what I wrote above. Suppose your service's name is clone-download (or any other name), the config variable daemon.clonedownload must be set in order to turn this service on. Without it, git-clone falls back to normal clone. Then we could either have config clonedownload.prefix or clonedownload.script. Or something like that. Also, we made a mistake before with transport protocol where the server talks first. I'm thinking this time, maybe we do it differently. git-clone connects to the service, then tells the server about its capability and whatever else (e.g. physical localtion, the max number of URLs it wants to receive...). The server waits for that first, then it can send URLs back to the client, then disconnect. Unless there's something very sepcial, I think we just follow current protocol and use pkt-line for communication (see "pkt-line Format" in Documentation/technical/pack-protocol.txt). We send one URL per pkt-line, terminated with null pkt-line. All this is about full duplex connections (git:// or ssh://). Smart http is not mentioned. But I think a simple GET request is enough for this (you'll have to touch http.c, but that can wait). > Also, for now I'm assuming the same options should be available for > prime-clone as are available for upload-pack (--strict, > --timeout=<n>). Let me know if any other features are necessary. > Also, let me know if I'm headed in the complete wrong direction... Heh.. I didn't know upload-pack took --timeout :D --strict should be there (repository discovery is the same everywhere). Not so sure about --timeout (upload-pack can take a long time so timeout gives more resource control, this new services should be instant) -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html