On 2022-07-25 14:53, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <derrickstolee@xxxxxxxxxx> > > The previous change introduced the bundle URI design document. It > creates a flexible set of options that allow bundle providers many ways > to organize Git object data and speed up clones and fetches. It is > particularly important that we have flexibility so we can apply future > advancements as new ideas for efficiently organizing Git data are > discovered. > > However, the design document does not provide even an example of how > bundles could be organized, and that makes it difficult to envision how > the feature should work at the end of the implementation plan. > > Add a section that details how a bundle provider could work, including > using the Git server advertisement for multiple geo-distributed servers. > This organization is based on the GVFS Cache Servers which have > successfully used similar ideas to provide fast object access and > reduced server load for very large repositories. Thanks! This patch is helpful guidance for bundle server implementors. > +This example organization is a simplified model of what is used by the > +GVFS Cache Servers (see section near the end of this document) which have > +been beneficial in speeding up clones and fetches for very large > +repositories, although using extra software outside of Git. Nit: might be a good idea to use "VFS for Git" rather than the old name "GVFS" [1]. > +The bundle provider deploys servers across multiple geographies. Each > +server manages its own bundle set. The server can track a number of Git > +repositories, but provides a bundle list for each based on a pattern. For > +example, when mirroring a repository at `https://<domain>/<org>/<repo>` > +the bundle server could have its bundle list available at > +`https://<server-url>/<domain>/<org>/<repo>`. The origin Git server can > +list all of these servers under the "any" mode: > + > + [bundle] > + version = 1 > + mode = any > + > + [bundle "eastus"] > + uri = https://eastus.example.com/<domain>/<org>/<repo> > + > + [bundle "europe"] > + uri = https://europe.example.com/<domain>/<org>/<repo> > + > + [bundle "apac"] > + uri = https://apac.example.com/<domain>/<org>/<repo> > + > +This "list of lists" is static and only changes if a bundle server is > +added or removed. > + > +Each bundle server manages its own set of bundles. The initial bundle list > +contains only a single bundle, containing all of the objects received from > +cloning the repository from the origin server. The list uses the > +`creationToken` heuristic and a `creationToken` is made for the bundle > +based on the server's timestamp. Just to confirm, in this example the origin server advertises a single URL (over v2 protocol) that points to this example "list of lists"? Remote -> 1 URL -> List(any/split by geo) -> List(all/split by time) > +The bundle server runs regularly-scheduled updates for the bundle list, > +such as once a day. During this task, the server fetches the latest > +contents from the origin server and generates a bundle containing the > +objects reachable from the latest origin refs, but not contained in a > +previously-computed bundle. This bundle is added to the list, with care > +that the `creationToken` is strictly greater than the previous maximum > +`creationToken`. > + > +When the bundle list grows too large, say more than 30 bundles, then the > +oldest "_N_ minus 30" bundles are combined into a single bundle. This > +bundle's `creationToken` is equal to the maximum `creationToken` among the > +merged bundles. > + > +An example bundle list is provided here, although it only has two daily > +bundles and not a full list of 30: > + > + [bundle] > + version = 1 > + mode = all > + heuristic = creationToken > + > + [bundle "2022-02-13-1644770820-daily"] > + uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644770820-daily.bundle > + creationToken = 1644770820 > + > + [bundle "2022-02-09-1644442601-daily"] > + uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644442601-daily.bundle > + creationToken = 1644442601 > + > + [bundle "2022-02-02-1643842562"] > + uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-02-1643842562.bundle > + creationToken = 1643842562 > + > +To avoid storing and serving object data in perpetuity despite becoming > +unreachable in the origin server, this bundle merge can be more careful. > +Instead of taking an absolute union of the old bundles, instead the bundle > +can be created by looking at the newer bundles and ensuring that their > +necessary commits are all available in this merged bundle (or in another > +one of the newer bundles). This allows "expiring" object data that is not > +being used by new commits in this window of time. That data could be > +reintroduced by a later push. > + > +The intention of this data organization has two main goals. First, initial > +clones of the repository become faster by downloading precomputed object > +data from a closer source. Second, `git fetch` commands can be faster, > +especially if the client has not fetched for a few days. However, if a > +client does not fetch for 30 days, then the bundle list organization would > +cause redownloading a large amount of object data. > + > +One way to make this organization more useful to users who fetch frequently > +is to have more frequent bundle creation. For example, bundles could be > +created every hour, and then once a day those "hourly" bundles could be > +merged into a "daily" bundle. The daily bundles are merged into the > +oldest bundle after 30 days. > + > +It is recommened that this bundle strategy is repeated with the `blob:none` > +filter if clients of this repository are expecting to use blobless partial > +clones. This list of blobless bundles stays in the same list as the full > +bundles, but uses the `bundle.<id>.filter` key to separate the two groups. > +For very large repositories, the bundle provider may want to _only_ provide > +blobless bundles. > + > Implementation Plan > ------------------- > In general this looks good to me! [1] https://github.com/microsoft/VFSForGit/issues/72