Re: [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation

Derrick Stolee <stolee@xxxxxxxxx> · Fri, 29 Oct 2021 14:46:19 -0400

On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
> This implements a new "bundle-uri" protocol v2 extension, which allows
> servers to advertise *.bundle files which clients can pre-seed their
> full "clone"'s or incremental "fetch"'s from.
> 
> This is both an alternative to, and complimentary to the existing
> "packfile-uri" mechanism, i.e. servers and/or clients can pick one or
> both, but would generally pick one over the other.
> 
> This "bundle-uri" mechanism has the advantage of being dumber, and
> offloads more complexity from the server side to the client
> side.

Generally, I like that using bundles presents an easier way to serve
static content from an alternative source and then let Git's fetch
negotiation catch up with the remainder.

However, after inspecting your design and talking to some GitHub
engineers who know more about CDNs and general internet things than I
do, I want to propose an alternative design. I think this new design
is simultaneously more flexible as well as promotes further decoupling
of the origin Git server and the bundle contents.

Your proposed design extends protocol v2 to let the client request a
list of bundle URIs from the origin server. However, this still requires
the origin server to know about this list. Further, your implementation
focuses on the server side without integrating with the client.

I propose that we flip this around. The "bundle server" should know
which bundles are available at which URIs, and the client should contact
the bundle server directly for a "table of contents" that lists these
URIs, along with metadata related to each URI. The origin Git server
then would only need to store the list of bundle servers and the URIs
to their table of contents. The client could then pick from among those
bundle servers (probably by ping time, or randomly) to start the bundle
downloads.

To summarize, there are two pieces here, that can be implemented at
different times:

1. Create a specification for a "bundle server" that doesn't need to
   speak the Git protocol at all. This could be a REST API specification
   using well-established standards such as JSON for the table of
   contents.

2. Create a way for the origin Git server to advertise known bundle
   servers to clients so they can automatically benefit from faster
   downloads without needing to know about bundle servers.

There are a few key benefits to this approach:

 * Further decoupling. The origin Git server doesn't need to know how
   the bundle server organizes its bundles. This allows maximum flexibility
   depending on whether the bundles are stored in something like a CDN
   (where bundles can't be too big) or some kind of blob storage (where
   they can have arbitrarily large size).

 * The bundle servers could be run completely independently from the
   origin Git server. Organizations could run their own bundle servers to
   host data in the same building as their build farms. As long as they
   can configure the bundle location at clone/fetch time, the origin Git
   server doesn't need to be involved.

While I didn't go so far as to create a clear standard or implement a
prototype in the Git codebase, I created a very simple prototype [1] using
a python script that parses a JSON table of contents and downloads
bundles into the Git repository. Then, I made a 'clone.sh' script that
initializes a repository using the bundle fetcher and fetching the
remainder from the origin Git server. I even computed static bundles for
the git.git repository based on where 'master' has been over several days
in the past month, to give an example of incremental bundles. You can
test the approach all the way to including the fetch to github.com (note
how the GitHub servers were not modified in any way for this).

[1] https://github.com/derrickstolee/bundles

There are a lot of limitations to the prototype, but it hopefully
demonstrates the possibility of using something other than the Git protocol
to solve these problems.

Let me know if you are interested in switching your approach to something
more like what I propose here. There are many more questions about what
information could/should be located in the table of contents and how it can
be extended in the future. I'm interested to explore that space with you.

Thanks,
-Stolee