I had a few questions and suggestions; below.
On 2022-06-29 21:40, Derrick Stolee via GitGitGadget wrote:> +Assuming a
`200 OK` response from the server, the content at the URL is
+inspected. First, Git attempts to parse the file as a bundle file of
+version 2 or higher. If the file is not a bundle, then the file is parsed
+as a plain-text file using Git's config parser. The key-value pairs in
+that config file are expected to describe a list of bundle URIs. If
+neither of these parse attempts succeed, then Git will report an error to
+the user that the bundle URI provided erroneous data.
+
+Any other data provided by the server is considered erroneous.
I wonder if it may be worth considering adding an optional server
requirement ("MAY" not "MUST") to provide a `Content-Type` header
indicating if the response is a bundle list (or bundle) to skip straight
to parsing the correct type of file? Eg "application/x-git-bundle-list"?
Even the simplest of content servers should be able to set Content-Type.
If not falling back to 'try parse bundle else try parse bundle list' is
still OK.
+bundle.mode::
+ (Required) This value has one of two values: `all` and `any`. When `all`
+ is specified, then the client should expect to need all of the listed
+ bundle URIs that match their repository's requirements. When `any` is
+ specified, then the client should expect that any one of the bundle URIs
+ that match their repository's requirements will suffice. Typically, the
+ `any` option is used to list a number of different bundle servers
+ located in different geographies.
Do you forsee any future where we'd want or need to specify 'sets' of
bundles where "all" _of_ "any" particular set would be required?
Eg. there are 3 sets of bundles (A, B, C), and the client would need to
download all bundles belonging to any of A, B, or C? Where ABC would be
different geo-distributed sets?
I guess what I'm getting at here is with this design (which I appreciate
is intentionally flexible), there are several different ways a server
could direct a client to bundles stored in a nearby geography:
1. Serve an "all" bundle list with geo-located bundle URIs?
2. Serve a global "any" bundle list with each single bundle in a
different geography (specified by `location`)
3. Serve a single bundle (not a list) with a different
Are any of these going to be preferred over another for potential client
optimisations?
+
+bundle.heuristic::
+ If this string-valued key exists, then the bundle list is designed to
+ work well with incremental `git fetch` commands. The heuristic signals
+ that there are additional keys available for each bundle that help
+ determine which subset of bundles the client should download.
+
+The remaining keys include an `<id>` segment which is a server-designated
+name for each available bundle.
Case-sensitive ID? A-Za-z0-9 only? "Same as Git config rules"?
+bundle.<id>.location::
+ This string value advertises a real-world location from where the bundle
+ URI is served. This can be used to present the user with an option for
+ which bundle URI to use or simply as an informative indicator of which
+ bundle URI was selected by Git. This is only valuable when
+ `bundle.mode` is `any`.
I assume `location` is just an opaque string that is just used for info
or display purposes? Does it make sense for other 'display' type strings
like 'name' or 'message'?
+Here is an example bundle list using the Git config format:
+
+```
+[bundle]
+ version = 1
+ mode = all
+ heuristic = creationToken
+
+[bundle "2022-02-09-1644442601-daily"]
+ uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle
+ timestamp = 1644442601
+
+[bundle "2022-02-02-1643842562"]
+ uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle
+ timestamp = 1643842562
+
+[bundle "2022-02-09-1644442631-daily-blobless"]
+ uri = 2022-02-09-1644442631-daily-blobless.bundle
+ timestamp = 1644442631
+ filter = blob:none
+
+[bundle "2022-02-02-1643842568-blobless"]
+ uri = /git/git/2022-02-02-1643842568-blobless.bundle
+ timestamp = 1643842568
+ filter = blob:none
+```
Do you mean to use `creationToken` in these examples rather than
`timestamp`?
+Advertising Bundle URIs
+-----------------------
+
...
+The client could choose an arbitrary bundle URI as an option _or_ select
+the URI with best performance by some exploratory checks. It is up to the
+bundle provider to decide if having multiple URIs is preferable to a
+single URI that is geodistributed through server-side infrastructure.
Would it make sense for the client to pick the first bundle URI rather
than an arbitrary one? The server could use information about the
request (origin IP/geography) to provide a sorted list of URIs by
physical distance to the client.
I guess if arbitrary is 'random' then this provides some client-side
load balancing over multiple potential servers too. Interested in your
thoughts behind what would be 'best practice' for a bundle server here.
+If the bundle provider does not provide a heuristic, then the client
+should attempt to inspect the bundle headers before downloading the full
+bundle data in case the bundle tips already exist in the client
+repository.
Would this default behaviour also be considered another explicit
heurisitic option? For example: `bundle.heuristic=default` or `inspect`.
Is it ever likely that the default behaviour would change?
+* The client receives a response other than `200 OK` (such as `404 Not Found`,
+ `401 Not Authorized`, or `500 Internal Server Error`). The client should
+ use the `credential.helper` to attempt authentication after the first
+ `401 Not Authorized` response, but a second such response is a failure.
I'd probably say a 500 is not solvable with a different set of
credentials, but potentially a retry or just `die`. Do we attempt to
`credential_fill()` with anything other than a 401 (or maybe 403 or 404)
elsewhere in Git?
+* A bundle download during a `git fetch` contains objects already in the
+ object database. This is probably unavoidable if we are using bundles
+ for fetches, since the client will almost always be slightly ahead of
+ the bundle servers after performing its "catch-up" fetch to the remote
+ server. This extra work is most wasteful when the client is fetching
+ much more frequently than the server is computing bundles, such as if
+ the client is using hourly prefetches with background maintenance, but
+ the server is computing bundles weekly. For this reason, the client
+ should not use bundle URIs for fetch unless the server has explicitly
+ recommended it through the `bundle.flags = forFetch` value.
`bundle.flags` is not mentioned elsewhere in this document. Might be
worth including this, and possible values, with the other key
definitions above.
+Implementation Plan
+-------------------
+
+This design document is being submitted on its own as an aspirational
+document, with the goal of implementing all of the mentioned client
+features over the course of several patch series. Here is a potential
+outline for submitting these features:
+
+1. Integrate bundle URIs into `git clone` with a `--bundle-uri` option.
+ This will include a new `git fetch --bundle-uri` mode for use as the
+ implementation underneath `git clone`. The initial version here will
+ expect a single bundle at the given URI.
+
+2. Implement the ability to parse a bundle list from a bundle URI and
+ update the `git fetch --bundle-uri` logic to properly distinguish
+ between `bundle.mode` options. Specifically design the feature so
+ that the config format parsing feeds a list of key-value pairs into the
+ bundle list logic.
+
+3. Create the `bundle-uri` protocol v2 verb so Git servers can advertise
+ bundle URIs using the key-value pairs. Plug into the existing key-value
+ input to the bundle list logic. Allow `git clone` to discover these
+ bundle URIs and bootstrap the client repository from the bundle data.
+ (This choice is an opt-in via a config option and a command-line
+ option.)
+
+4. Allow the client to understand the `bundle.flag=forFetch` configuration
+ and the `bundle.<id>.creationToken` heuristic. When `git clone`
+ discovers a bundle URI with `bundle.flag=forFetch`, it configures the
+ client repository to check that bundle URI during later `git fetch <remote>`
+ commands.
+
+5. Allow clients to discover bundle URIs during `git fetch` and configure
+ a bundle URI for later fetches if `bundle.flag=forFetch`.
+
+6. Implement the "inspect headers" heuristic to reduce data downloads when
+ the `bundle.<id>.creationToken` heuristic is not available.
+
+As these features are reviewed, this plan might be updated. We also expect
+that new designs will be discovered and implemented as this feature
+matures and becomes used in real-world scenarios.
This plan seems logical to me at least! :-)
--
Thanks,
Matthew