Hi Christian, On Tue, Sep 25, 2018, Christian Couder wrote: > In the cover letter there is a "Discussion" section which is about > this, but I agree that it might not be very clear. > > The main issue that this patch series tries to solve is that > extensions.partialclone config option limits the partial clone and > promisor features to only one remote. One related issue is that it > also prevents to have other kind of promisor/partial clone/odb > remotes. By other kind I mean remotes that would not necessarily be > git repos, but that could store objects (that's where ODB, for Object > DataBase, comes from) and could provide those objects to Git through a > helper (or driver) script or program. Thanks for this explanation. I took the opportunity to learn more while you were in the bay area for the google summer of code mentor summit and learned a little more, which was very helpful to me. The broader picture is that this is meant to make Git natively handle large blobs in a nicer way. The design in this series has a few components: 1. Teaching partial clone to attempt to fetch missing objects from multiple remotes instead of only one. This is useful because you can have a server that is nearby and cheaper to serve from (some kind of local cache server) that you make requests to first before falling back to the canonical source of objects. 2. Simplifying the protocol for fetching missing objects so that it can be satisfied by a lighter weight object storage system than a full Git server. The ODB helpers introduced in this series are meant to speak such a simpler protocol since they are only used for one-off requests of a collection of missing objects instead of needing to understand refs, Git's negotiation, etc. 3. (possibly, though not in this series) Making the criteria for what objects can be missing more aggressive, so that I can "git add" a large file and work with it using Git without even having a second copy of that object in my local object store. For (2), I would like to see us improve the remote helper infrastructure instead of introducing a new ODB helper. Remote helpers are already permitted to fetch some objects without listing refs --- perhaps we will want to i. split listing refs to a separate capability, so that a remote helper can advertise that it doesn't support that. (Alternatively the remote could advertise that it has no refs.) ii. Use the "long-running process" mechanism to improve how Git communicates with a remote helper. For (1), things get more tricky. In an object store from a partial clone today, we relax the ordinary "closure under reachability" invariant but in a minor way. We'll need to work out how this works with multiple promisor remotes. The idea today is that there are two kinds of packs: promisor packs (from the promisor remote) and non-promisor packs. Promisor packs are allowed to have reachability edges (for example a tree->blob edge) that point to a missing object, since the promisor remote has promised that we will be able to access that object on demand. Non-promisor packs are also allowed to have reachability edges that point to a missing object, as long as there is a reachability edge from an object in a promisor pack to the same object (because of the same promise). See "Handling Missing Objects" in Documentation/technical/partial-clone.txt for more details. To prevent older versions of Git from being confused by partial clone repositories, they use the repositoryFormatVersion mechanism: [core] repositoryFormatVersion = 1 [extensions] partialClone = ... If we change the invariant, we will need to use a new extensions.* key to ensure that versions of Git that are not aware of the new invariant do not operate on the repository. A promisor pack is indicated by there being a .promisor file next to the usual .pack file. Currently the .promisor file is empty. The previous idea was that once we want more metadata (e.g. for the sake of multiple promisor remotes), we could write it in that file. For example, remotes could be associated to a <promisor-id> and the .promisor file could indicate which <promisor-id> has promised to serve requests for objects reachable from objects in this pack. That will complicate the object access code as well, since currently we only find who has promised an object during "git fsck" and similar operations. During everyday access we do not care which promisor pack caused the object to be promised, since there is only one promisor remote to fetch from anyway. So much for the current setup. For (1), I believe you are proposing to still have only one effective <promisor-id>, so it doesn't necessarily require modifying the extensions.* configuration. Instead, the idea is that when trying to access an object, we would follow one of a list of steps: 1. First, check the local object store. If it's there, we're done. 2. Second, try alternates --- maybe the object is in one of those! 3. Now, try promisor remotes, one at a time, in user-configured order. In other words, I think that for (1) all we would need is a new configuration [object] missingObjectRemote = local-cache-remote missingObjectRemote = origin The semantics would be that when trying to access a promised object, we attempt to fetch from these remotes one at a time, in the order specified. We could require that the remote named in extensions.partialClone be one of the listed remotes, without having to care where it shows up in the list. That way, we get the benefit (1) without having to change the semantics of extensions.partialClone and without having to care about the order of sections in the config. What do you think? Thanks, Jonathan