Re: [PATCH v4 9/9] Documentation/config: add odb.<name>.promisorRemote

Jonathan Nieder <jrnieder@xxxxxxxxx> · Tue, 16 Oct 2018 10:43:04 -0700

Hi Christian,

On Tue, Sep 25, 2018, Christian Couder wrote:

> In the cover letter there is a "Discussion" section which is about
> this, but I agree that it might not be very clear.
>
> The main issue that this patch series tries to solve is that
> extensions.partialclone config option limits the partial clone and
> promisor features to only one remote. One related issue is that it
> also prevents to have other kind of promisor/partial clone/odb
> remotes. By other kind I mean remotes that would not necessarily be
> git repos, but that could store objects (that's where ODB, for Object
> DataBase, comes from) and could provide those objects to Git through a
> helper (or driver) script or program.

Thanks for this explanation.  I took the opportunity to learn more
while you were in the bay area for the google summer of code mentor
summit and learned a little more, which was very helpful to me.

The broader picture is that this is meant to make Git natively handle
large blobs in a nicer way.  The design in this series has a few
components:

 1. Teaching partial clone to attempt to fetch missing objects from
    multiple remotes instead of only one.  This is useful because you
    can have a server that is nearby and cheaper to serve from (some
    kind of local cache server) that you make requests to first before
    falling back to the canonical source of objects.

 2. Simplifying the protocol for fetching missing objects so that it
    can be satisfied by a lighter weight object storage system than
    a full Git server.  The ODB helpers introduced in this series are
    meant to speak such a simpler protocol since they are only used
    for one-off requests of a collection of missing objects instead of
    needing to understand refs, Git's negotiation, etc.

 3. (possibly, though not in this series) Making the criteria for what
    objects can be missing more aggressive, so that I can "git add"
    a large file and work with it using Git without even having a
    second copy of that object in my local object store.

For (2), I would like to see us improve the remote helper
infrastructure instead of introducing a new ODB helper.  Remote
helpers are already permitted to fetch some objects without listing
refs --- perhaps we will want to

 i. split listing refs to a separate capability, so that a remote
    helper can advertise that it doesn't support that.  (Alternatively
    the remote could advertise that it has no refs.)

 ii. Use the "long-running process" mechanism to improve how Git
     communicates with a remote helper.

For (1), things get more tricky.  In an object store from a partial
clone today, we relax the ordinary "closure under reachability"
invariant but in a minor way.  We'll need to work out how this works
with multiple promisor remotes.

The idea today is that there are two kinds of packs: promisor packs
(from the promisor remote) and non-promisor packs.  Promisor packs are
allowed to have reachability edges (for example a tree->blob edge)
that point to a missing object, since the promisor remote has promised
that we will be able to access that object on demand.  Non-promisor
packs are also allowed to have reachability edges that point to a
missing object, as long as there is a reachability edge from an object
in a promisor pack to the same object (because of the same promise).
See "Handling Missing Objects" in Documentation/technical/partial-clone.txt
for more details.

To prevent older versions of Git from being confused by partial clone
repositories, they use the repositoryFormatVersion mechanism:

	[core]
		repositoryFormatVersion = 1
	[extensions]
		partialClone = ...

If we change the invariant, we will need to use a new extensions.* key
to ensure that versions of Git that are not aware of the new invariant
do not operate on the repository.

A promisor pack is indicated by there being a .promisor file next to
the usual .pack file.  Currently the .promisor file is empty.  The
previous idea was that once we want more metadata (e.g. for the sake of
multiple promisor remotes), we could write it in that file.  For
example, remotes could be associated to a <promisor-id> and the
.promisor file could indicate which <promisor-id> has promised to serve
requests for objects reachable from objects in this pack.

That will complicate the object access code as well, since currently
we only find who has promised an object during "git fsck" and similar
operations.  During everyday access we do not care which promisor
pack caused the object to be promised, since there is only one promisor
remote to fetch from anyway.

So much for the current setup.  For (1), I believe you are proposing to
still have only one effective <promisor-id>, so it doesn't necessarily
require modifying the extensions.* configuration.  Instead, the idea is
that when trying to access an object, we would follow one of a list of
steps:

 1. First, check the local object store. If it's there, we're done.
 2. Second, try alternates --- maybe the object is in one of those!
 3. Now, try promisor remotes, one at a time, in user-configured order.

In other words, I think that for (1) all we would need is a new
configuration

	[object]
		missingObjectRemote = local-cache-remote
		missingObjectRemote = origin

The semantics would be that when trying to access a promised object,
we attempt to fetch from these remotes one at a time, in the order
specified.  We could require that the remote named in
extensions.partialClone be one of the listed remotes, without having
to care where it shows up in the list.

That way, we get the benefit (1) without having to change the
semantics of extensions.partialClone and without having to care about
the order of sections in the config.  What do you think?

Thanks,
Jonathan