Re: [PATCH v3 2/2] bundle-uri: add example bundle organization

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2022-07-25 14:53, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@xxxxxxxxxx>
> 
> The previous change introduced the bundle URI design document. It
> creates a flexible set of options that allow bundle providers many ways
> to organize Git object data and speed up clones and fetches. It is
> particularly important that we have flexibility so we can apply future
> advancements as new ideas for efficiently organizing Git data are
> discovered.
> 
> However, the design document does not provide even an example of how
> bundles could be organized, and that makes it difficult to envision how
> the feature should work at the end of the implementation plan.
> 
> Add a section that details how a bundle provider could work, including
> using the Git server advertisement for multiple geo-distributed servers.
> This organization is based on the GVFS Cache Servers which have
> successfully used similar ideas to provide fast object access and
> reduced server load for very large repositories.
Thanks! This patch is helpful guidance for bundle server implementors.
> +This example organization is a simplified model of what is used by the
> +GVFS Cache Servers (see section near the end of this document) which have
> +been beneficial in speeding up clones and fetches for very large
> +repositories, although using extra software outside of Git.

Nit: might be a good idea to use "VFS for Git" rather than the old name
"GVFS" [1].

> +The bundle provider deploys servers across multiple geographies. Each
> +server manages its own bundle set. The server can track a number of Git
> +repositories, but provides a bundle list for each based on a pattern. For
> +example, when mirroring a repository at `https://<domain>/<org>/<repo>`
> +the bundle server could have its bundle list available at
> +`https://<server-url>/<domain>/<org>/<repo>`. The origin Git server can
> +list all of these servers under the "any" mode:
> +
> +	[bundle]
> +		version = 1
> +		mode = any
> +		
> +	[bundle "eastus"]
> +		uri = https://eastus.example.com/<domain>/<org>/<repo>
> +		
> +	[bundle "europe"]
> +		uri = https://europe.example.com/<domain>/<org>/<repo>
> +		
> +	[bundle "apac"]
> +		uri = https://apac.example.com/<domain>/<org>/<repo>
> +
> +This "list of lists" is static and only changes if a bundle server is
> +added or removed.
> +
> +Each bundle server manages its own set of bundles. The initial bundle list
> +contains only a single bundle, containing all of the objects received from
> +cloning the repository from the origin server. The list uses the
> +`creationToken` heuristic and a `creationToken` is made for the bundle
> +based on the server's timestamp.

Just to confirm, in this example the origin server advertises a single
URL (over v2 protocol) that points to this example "list of lists"?

Remote -> 1 URL -> List(any/split by geo) -> List(all/split by time)

> +The bundle server runs regularly-scheduled updates for the bundle list,
> +such as once a day. During this task, the server fetches the latest
> +contents from the origin server and generates a bundle containing the
> +objects reachable from the latest origin refs, but not contained in a
> +previously-computed bundle. This bundle is added to the list, with care
> +that the `creationToken` is strictly greater than the previous maximum
> +`creationToken`.
> +
> +When the bundle list grows too large, say more than 30 bundles, then the
> +oldest "_N_ minus 30" bundles are combined into a single bundle. This
> +bundle's `creationToken` is equal to the maximum `creationToken` among the
> +merged bundles.
> +
> +An example bundle list is provided here, although it only has two daily
> +bundles and not a full list of 30:
> +
> +	[bundle]
> +		version = 1
> +		mode = all
> +		heuristic = creationToken
> +
> +	[bundle "2022-02-13-1644770820-daily"]
> +		uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644770820-daily.bundle
> +		creationToken = 1644770820
> +
> +	[bundle "2022-02-09-1644442601-daily"]
> +		uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644442601-daily.bundle
> +		creationToken = 1644442601
> +
> +	[bundle "2022-02-02-1643842562"]
> +		uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-02-1643842562.bundle
> +		creationToken = 1643842562
> +
> +To avoid storing and serving object data in perpetuity despite becoming
> +unreachable in the origin server, this bundle merge can be more careful.
> +Instead of taking an absolute union of the old bundles, instead the bundle
> +can be created by looking at the newer bundles and ensuring that their
> +necessary commits are all available in this merged bundle (or in another
> +one of the newer bundles). This allows "expiring" object data that is not
> +being used by new commits in this window of time. That data could be
> +reintroduced by a later push.
> +
> +The intention of this data organization has two main goals. First, initial
> +clones of the repository become faster by downloading precomputed object
> +data from a closer source. Second, `git fetch` commands can be faster,
> +especially if the client has not fetched for a few days. However, if a
> +client does not fetch for 30 days, then the bundle list organization would
> +cause redownloading a large amount of object data.
> +
> +One way to make this organization more useful to users who fetch frequently
> +is to have more frequent bundle creation. For example, bundles could be
> +created every hour, and then once a day those "hourly" bundles could be
> +merged into a "daily" bundle. The daily bundles are merged into the
> +oldest bundle after 30 days.
> +
> +It is recommened that this bundle strategy is repeated with the `blob:none`
> +filter if clients of this repository are expecting to use blobless partial
> +clones. This list of blobless bundles stays in the same list as the full
> +bundles, but uses the `bundle.<id>.filter` key to separate the two groups.
> +For very large repositories, the bundle provider may want to _only_ provide
> +blobless bundles.
> +
>  Implementation Plan
>  -------------------
>  
In general this looks good to me!

[1] https://github.com/microsoft/VFSForGit/issues/72



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux