Re: [PATCH v2 1/6] docs: document bundle URI standard

Matthew John Cheetham <mjcheetham@xxxxxxxxxxx> · Thu, 21 Jul 2022 13:09:33 +0100

I had a few questions and suggestions; below.

On 2022-06-29 21:40, Derrick Stolee via GitGitGadget wrote:> +Assuming a 
`200 OK` response from the server, the content at the URL is
+inspected. First, Git attempts to parse the file as a bundle file of
+version 2 or higher. If the file is not a bundle, then the file is parsed
+as a plain-text file using Git's config parser. The key-value pairs in
+that config file are expected to describe a list of bundle URIs. If
+neither of these parse attempts succeed, then Git will report an error to
+the user that the bundle URI provided erroneous data.
+
+Any other data provided by the server is considered erroneous.

I wonder if it may be worth considering adding an optional server 
requirement ("MAY" not "MUST") to provide a `Content-Type` header 
indicating if the response is a bundle list (or bundle) to skip straight 
to parsing the correct type of file? Eg "application/x-git-bundle-list"?

Even the simplest of content servers should be able to set Content-Type. 
If not falling back to 'try parse bundle else try parse bundle list' is 
still OK.

+bundle.mode::
+	(Required) This value has one of two values: `all` and `any`. When `all`
+	is specified, then the client should expect to need all of the listed
+	bundle URIs that match their repository's requirements. When `any` is
+	specified, then the client should expect that any one of the bundle URIs
+	that match their repository's requirements will suffice. Typically, the
+	`any` option is used to list a number of different bundle servers
+	located in different geographies.

Do you forsee any future where we'd want or need to specify 'sets' of 
bundles where "all" _of_ "any" particular set would be required?
Eg. there are 3 sets of bundles (A, B, C), and the client would need to 
download all bundles belonging to any of A, B, or C? Where ABC would be 
different geo-distributed sets?

I guess what I'm getting at here is with this design (which I appreciate 
is intentionally flexible), there are several different ways a server 
could direct a client to bundles stored in a nearby geography:

1. Serve an "all" bundle list with geo-located bundle URIs?
2. Serve a global "any" bundle list with each single bundle in a 
different geography (specified by `location`)
3. Serve a single bundle (not a list) with a different

Are any of these going to be preferred over another for potential client 
optimisations?

+
+bundle.heuristic::
+	If this string-valued key exists, then the bundle list is designed to
+	work well with incremental `git fetch` commands. The heuristic signals
+	that there are additional keys available for each bundle that help
+	determine which subset of bundles the client should download.
+
+The remaining keys include an `<id>` segment which is a server-designated
+name for each available bundle.

Case-sensitive ID? A-Za-z0-9 only? "Same as Git config rules"?

+bundle.<id>.location::
+	This string value advertises a real-world location from where the bundle
+	URI is served. This can be used to present the user with an option for
+	which bundle URI to use or simply as an informative indicator of which
+	bundle URI was selected by Git. This is only valuable when
+	`bundle.mode` is `any`.

I assume `location` is just an opaque string that is just used for info 
or display purposes? Does it make sense for other 'display' type strings 
like 'name' or 'message'?

+Here is an example bundle list using the Git config format:
+
+```
+[bundle]
+	version = 1
+	mode = all
+	heuristic = creationToken
+
+[bundle "2022-02-09-1644442601-daily"]
+	uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle
+	timestamp = 1644442601
+
+[bundle "2022-02-02-1643842562"]
+	uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle
+	timestamp = 1643842562
+
+[bundle "2022-02-09-1644442631-daily-blobless"]
+	uri = 2022-02-09-1644442631-daily-blobless.bundle
+	timestamp = 1644442631
+	filter = blob:none
+
+[bundle "2022-02-02-1643842568-blobless"]
+	uri = /git/git/2022-02-02-1643842568-blobless.bundle
+	timestamp = 1643842568
+	filter = blob:none
+```

Do you mean to use `creationToken` in these examples rather than 
`timestamp`?

+Advertising Bundle URIs
+-----------------------
+
...
+The client could choose an arbitrary bundle URI as an option _or_ select
+the URI with best performance by some exploratory checks. It is up to the
+bundle provider to decide if having multiple URIs is preferable to a
+single URI that is geodistributed through server-side infrastructure.

Would it make sense for the client to pick the first bundle URI rather 
than an arbitrary one? The server could use information about the 
request (origin IP/geography) to provide a sorted list of URIs by 
physical distance to the client.

I guess if arbitrary is 'random' then this provides some client-side 
load balancing over multiple potential servers too. Interested in your 
thoughts behind what would be 'best practice' for a bundle server here.

+If the bundle provider does not provide a heuristic, then the client
+should attempt to inspect the bundle headers before downloading the full
+bundle data in case the bundle tips already exist in the client
+repository.

Would this default behaviour also be considered another explicit 
heurisitic option? For example: `bundle.heuristic=default` or `inspect`.
Is it ever likely that the default behaviour would change?

+* The client receives a response other than `200 OK` (such as `404 Not Found`,
+  `401 Not Authorized`, or `500 Internal Server Error`). The client should
+  use the `credential.helper` to attempt authentication after the first
+  `401 Not Authorized` response, but a second such response is a failure.

I'd probably say a 500 is not solvable with a different set of 
credentials, but potentially a retry or just `die`. Do we attempt to 
`credential_fill()` with anything other than a 401 (or maybe 403 or 404) 
elsewhere in Git?

+* A bundle download during a `git fetch` contains objects already in the
+  object database. This is probably unavoidable if we are using bundles
+  for fetches, since the client will almost always be slightly ahead of
+  the bundle servers after performing its "catch-up" fetch to the remote
+  server. This extra work is most wasteful when the client is fetching
+  much more frequently than the server is computing bundles, such as if
+  the client is using hourly prefetches with background maintenance, but
+  the server is computing bundles weekly. For this reason, the client
+  should not use bundle URIs for fetch unless the server has explicitly
+  recommended it through the `bundle.flags = forFetch` value.

`bundle.flags` is not mentioned elsewhere in this document. Might be 
worth including this, and possible values, with the other key 
definitions above.

+Implementation Plan
+-------------------
+
+This design document is being submitted on its own as an aspirational
+document, with the goal of implementing all of the mentioned client
+features over the course of several patch series. Here is a potential
+outline for submitting these features:
+
+1. Integrate bundle URIs into `git clone` with a `--bundle-uri` option.
+   This will include a new `git fetch --bundle-uri` mode for use as the
+   implementation underneath `git clone`. The initial version here will
+   expect a single bundle at the given URI.
+
+2. Implement the ability to parse a bundle list from a bundle URI and
+   update the `git fetch --bundle-uri` logic to properly distinguish
+   between `bundle.mode` options. Specifically design the feature so
+   that the config format parsing feeds a list of key-value pairs into the
+   bundle list logic.
+
+3. Create the `bundle-uri` protocol v2 verb so Git servers can advertise
+   bundle URIs using the key-value pairs. Plug into the existing key-value
+   input to the bundle list logic. Allow `git clone` to discover these
+   bundle URIs and bootstrap the client repository from the bundle data.
+   (This choice is an opt-in via a config option and a command-line
+   option.)
+
+4. Allow the client to understand the `bundle.flag=forFetch` configuration
+   and the `bundle.<id>.creationToken` heuristic. When `git clone`
+   discovers a bundle URI with `bundle.flag=forFetch`, it configures the
+   client repository to check that bundle URI during later `git fetch <remote>`
+   commands.
+
+5. Allow clients to discover bundle URIs during `git fetch` and configure
+   a bundle URI for later fetches if `bundle.flag=forFetch`.
+
+6. Implement the "inspect headers" heuristic to reduce data downloads when
+   the `bundle.<id>.creationToken` heuristic is not available.
+
+As these features are reviewed, this plan might be updated. We also expect
+that new designs will be discovered and implemented as this feature
+matures and becomes used in real-world scenarios.

This plan seems logical to me at least! :-)

--
Thanks,
Matthew