Re: [PATCH 1/6] docs: document bundle URI standard

Derrick Stolee <derrickstolee@xxxxxxxxxx> · Wed, 8 Jun 2022 15:46:50 -0400

On 6/6/2022 8:33 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
> 
>> +Cloning with Bundle URIs
>> +------------------------
>> +
>> +The primary need for bundle URIs is to speed up clones. The Git client
>> +will interact with bundle URIs according to the following flow:
>> +
>> +1. The user specifies a bundle URI with the `--bundle-uri` command-line
>> +   option _or_ the client discovers a bundle list advertised by the
>> +   Git server.
>> +
>> +2. If the downloaded data from a bundle URI is a bundle, then the client
>> +   inspects the bundle headers to check that the negative commit OIDs are
> 
> Although "negative" would be understandable to pros, the commits
> required to unbundle a bundle file are officially called
> "prerequisite commits" (cf. "git bundle --help"), so that may be
> easier to understand by ordinary readers.

Ok. I can work to replace this language throughout.

>> +   present in the client repository. If some are missing, then the client
>> +   delays unbundling until other bundles have been unbundled, making those
>> +   OIDs present. When all required OIDs are present, the client unbundles
>> +   that data using a refspec. The default refspec is
>> +   `+refs/heads/*:refs/bundles/*`, but this can be configured.
> 
> The refs/bundles/ appear in the document only here, and it is
> unclear why we even want it (I am assuming this is against gc while
> "git clone" is still running) or how we are going to retire it, if
> ever.  If there are multiple bundle files involved in this "git clone",
> to anchor objects that are necessary against "gc", don't we need to use
> refs/bundles/<i>/* or something like that, where <i> is uniquely assigned
> number locally?

The real reason to keep them in refs/bundles/ is because then those
refs can be used in the incremental 'git fetch' after downloading the
bundles (in perpetuity) while not stomping refs/heads or refs/remotes/

>> +3. If the file is instead a bundle list, then the client inspects the
>> +   `bundle.list.mode` to see if the list is of the `all` or `any` form.
> 
> If the downloaded file is not a bundle (e.g. "git bundle list-heads"
> barfs on it) and it is not parseable with our configuration parser,
> do we error out, or do we pretend as if that bundle file or the TOC
> did not exist (if the bundle list with mode=any at the higher level
> has appropriate alternatives)?

I think the best thing to do would be to fail as gracefully as
possible here. In particular: give a warning that the remote did not
give anything helpful, but continue with the normal fetching from the
remote without help from bundles.

>> +   a. If `bundle.list.mode=all`, then the client considers all bundle
>> +      URIs. The list is reduced based on the `bundle.<id>.filter` options
>> +      matching the client repository's partial clone filter.
> 
> OK, this answers my earlier question nicely.  It probably means that
> either the presentation order needs a bit of rethinking, or "we
> group by .filter" needs to be mentioned a lot earlier.

Ok. It may even need to be _implemented_ a lot earlier, looking at my
current outline of future changes.

>> Then, all
>> +      bundle URIs are requested. If the `bundle.<id>.timestamp` heuristic
>> +      is provided, then the bundles are downloaded in reverse-
>> +      chronological order, stopping when a bundle has all required OIDs.
> 
> Stop as soon as just one bundle has all the prerequisite objects, or
> should we keep going until all bundles have their prerequisites
> satisfied?  I presume it is the latter.

True. We keep going until all have their required OIDs. With a .requires
chain of "1 requires 2" and "2 requires 3", it is possible that "1"
has a prerequisite commit that actually exists in "3", but somehow the
bundle "2" can unbundle in the client's repository.

One could argue that the "1" bundle should have a direct .requires
relationship on "3" (in addition to "2"), but for simplicity I think
it is OK to imply "transitive requires".

Using "all" here should make it clearer.

>> +      The bundles can then be unbundled in chronological order. The client
>> +      stores the latest timestamp as a heuristic for avoiding future
>> +      downloads if the bundle list does not advertise newer bundles.
> 
> So we see a list, we start grabbing from new to old.  Newer ones
> that are based on older ones may have dependencies, so we do not
> unbndle until we have all the prerequisites for them.  The bundles
> that satisfy their prerequisites are unbundled---that would give us
> enough objects to play with.  What happens to the refs recorded in
> them, though?

Those refs are translated into refs/bundles/. If multiple bundles
have the same refs/heads/ included, then the newest ones would
overwite those refs in refs/bundles/.

> Is the timestamp per the serving host, or per the CDN host that
> serve us bundle files, or...?  I guess it is premature to discuss it
> here. "git clone" bootstraps from the advertisement made only by a
> single serving host, so the single newest timestamp among the
> bundles used from the bundle list is what we store here.  How that
> timestamp is used is primarily of interest in future fetching, which
> would be discussed later.

The bundle provider is responsible for making the timestamps make
sense as an opaque increasing token.

>> +Fetching with Bundle URIs
>> +-------------------------
>> +
>> +When the client fetches new data, it can decide to fetch from bundle
>> +servers before fetching from the origin remote. This could be done via a
>> +command-line option, but it is more likely useful to use a config value
>> +such as the one specified during the clone.
>> +
>> +The fetch operation follows the same procedure to download bundles from a
>> +bundle list (although we do _not_ want to use parallel downloads here). We
>> +expect that the process will end when all negative commit OIDs in a thin
>> +bundle are already in the object database.
> 
> I do not see why we do not want to use parallel download, though.
> If our last bundle download was last month, and they have two newer
> bundles since then, don't we want to grab both at the same time?
> Wasn't that the point of recording the newest timestamp when "git
> clone" grabbed bundles?

In theory, we have also been fetching from the origin, so we might have
already received all of the objects in that second (or first!) bundle.
By going in reverse-chronological order, we minimize the amount of data
downloaded (assuming that is the most expensive part of the operation).

This is something where we have room to experiment. With Ævar's idea of
downloading only the headers, we could download all headers in parallel
and halt the download for any bundles where we have all of the ref tips.

>> +Error Conditions
>> +----------------
>> +
>> +If the Git client discovers something unexpected while downloading
>> +information according to a bundle URI or the bundle list found at that
>> +location, then Git can ignore that data and continue as if it was not
>> +given a bundle URI. The remote Git server is the ultimate source of truth,
>> +not the bundle URI.
>> +
>> +Here are a few example error conditions:
>> +
>> +* The client fails to connect with a server at the given URI or a connection
>> +  is lost without any chance to recover.
>> +
>> +* The client receives a response other than `200 OK` (such as `404 Not Found`,
>> +  `401 Not Authorized`, or `500 Internal Server Error`). The client should
>> +  use the `credential.helper` to attempt authentication after the first
>> +  `401 Not Authorized` response, but a second such response is a failure.
>> +
>> +* The client receives data that is not parsable as a bundle or table of
>> +  contents.
> 
> Is it an error if bundle.<id>.list and the contents disagree?

I think this can be flexible. It is not difficult to treat the .list
value as advisory, so we can ignore disagreements. If we start making
decisions that hinge on the value of .list, then we can start treating
it as an error.

Or: maybe the .list value is so "advisory" that it is useless, and I
should just drop it from the schema.

> It is fine to call the possibility other than "a bundle file" "table
> of contents", but then let's do so consistently throughout the document.
> When we explain bundle.<id>.list, we should not call the other
> possibility "list" but "table of contents", for example.

Sorry, I had intended to find-and-replace all of these instances, but
missed some that cross line breaks.

>> +* The bundle list describes a directed cycle in the
>> +  `bundle.<id>.requires` links.
>> +
>> +* A bundle includes a filter that does not match expectations.
> 
> Does this refer to a mismatch between the filter recorded in a
> bundle and bundle.<id>.filter entry that described the bundle?

Yes. I can make that more explicit.

>> +* The client cannot unbundle the bundles because the negative commit OIDs
>> +  are not in the object database and there are no more
>> +  `bundle.<id>.requires` links to follow.
> 
> Is a .requires link mandatory?  In a mode=all table of contents, we
> should not have to have .requires at all.  In the above description
> on how bundle files are downloaded and in what order in Clone and
> Fetch operations, I didn't see any mention of .requires at all, but
> I think there should be.  For example, the timestamp heuristics may
> say the bundle A is the latest.  In a mode=any table of contents,
> shouldn't bundles that contain prerequisite commits of the bundle A
> be pointed by A's .requires fields?

It could be considered useful for the .timestamp heuristic, or we
could infer the .requires value as "the next most-recent bundle" in
that heuristic. (So maybe we remove it from the schema.)

With "all" but no heuristic, then we need everything without knowing
where to "start" and trying to minimize downloads.

>> +4. Allow the client to understand the `bundle.list.forFetch` configuration
>> +   and the `bundle.<id>.timestamp` heuristic. When `git clone` discovers a
>> +   bundle URI with `bundle.list.forFetch=true`, it configures the client
>> +   repository to check that bundle URI during later `git fetch <remote>`
>> +   commands.
> 
> So bundle.list.forFetch is, unlike everything else we saw that
> looked like a configuration variable in this document, a
> configuration variable whose value is boolean?
> 
> Ah, no.  You mean the "git clone" sees a bundle URI, grabs it and
> sees a table of contents, and in it, finds "bundle.forFetch" is set
> to true?  Then "git fetch <remote>" is configured to also use bundle
> URI?

Yes. The bundle provider is advertising "I'm working hard to help you
with your 'git fetches' after your initial clone."

> It is unclear to me (with the information given here so far), why we
> want this.  Isn't this something the responder to "git fetch" can
> advertise over the wire?  If we leave a permanent record in the
> resulting repository to do the bundle URI during 'fetch', wouldn't
> it become more cumbersome to cancel (iow "no, you no longer want to
> talk to me with bundle URI feature") from the server side?

I've tried very hard to make the bundle provider as independent from
the origin Git server as possible (including no relationship at all).
Even if the origin Git server knows about a bundle provider at a
given URI, it does not necessarily know if _at this moment_ the
provider is bundling for fetches.

>> +5. Allow clients to discover bundle URIs during `git fetch` and configure
>> +   a bundle URI for later fetches if `bundle.list.forFetch=true`.
>> +
>> +6. Implement the "inspect headers" heuristic to reduce data downloads when
>> +   the `bundle.<id>.timestamp` heuristic is not available.
> 
> Sounds sensible, even though I do not offhand see why the "peek
> header and stop" is any less useful when the timestamp heurisitc is
> available.

Yes, it is still helpful with the heuristic. I can remove that
phrasing. (This functionality is _required_ to help incremental
fetches when a heuristic is not provided.)

>> +A major downside to this mechanism is that the origin server needs to know
>> +_exactly_ what is in those packfiles, and the packfiles need to be available
>> +to the user for some time after the server has responded. This coupling
>> +between the origin and the packfile data is difficult to manage.
> 
> Hmph.  I strongly suspect that there are Googlers on the list who
> have been managing such JGit server installations.  Has this
> "coupling" been difficult to manage for you guys in the real world?
>
>> +Further, this implementation is extremely hard to make work with fetches.
> 
> IOW, they do this only for clones and not fetches?

The last time I spoke with Googlers about this subject, what I heard
from them was "Yes, this is hard to get right for incremental fetches,
but the benefit for clones is big enough that we are happy with just
that." If advancements have come about since then to work with
incremental fetches, then I'd love to hear about it.

>> +Related Work: GVFS Cache Servers
>> +--------------------------------
>> ...
>> +During a `git fetch`, a hook requests the prefetch endpoint using the
>> +most-recent timestamp from a previously-downloaded prefetch packfile.
>> +Only the list of packfiles with later timestamps are downloaded.
> 
> That sounds quite straight-forward.  Do you envision that their
> incremental snapshot packfile chains can somehow be shared with the
> bundle URI implementations?  Doesn't it make it more cumbersome that
> this proposal uses the bundles as the encapsulation format, rather
> than packfiles?  As you are sending extra pieces of information on
> top of the payload in the form of table-of-contents already, I
> wonder if bundle.<id>.uri should point at a bare packfile (instead
> of a bundle), while multi-valued bundle.<id>.prerequisite give the
> prerequisite objects?  The machinery that is already generating the
> prefetch packfiles already know which packfile has what
> prerequisites in it, so it rather looks simpler if the solution did
> not involve bundles.

The prefetch packfiles could be replaced with bundle URIs, if desired.

The reason bundles were not needed for the GVFS protocol was that
all other object data not in those prefetch packfiles is downloaded
via direct requests (one object per request or a batch of objects
requested from a list of OIDs) and not from an incremental fetch. The
VFS for Git clients only talk to the origin server for the ref
advertisement and talk to the cache servers for the objects necessary
to satisfy the client's Git command. (Also: the client pushes directly
to the origin server.)

So in this world, the bundle URIs could be used as a replacement for
downloading these prefetch packfiles (bundles with filter=blob:none)
but the bundled refs become useless to the client.

Thanks,
-Stolee