On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote: > This implements a new "bundle-uri" protocol v2 extension, which allows > servers to advertise *.bundle files which clients can pre-seed their > full "clone"'s or incremental "fetch"'s from. > > This is both an alternative to, and complimentary to the existing > "packfile-uri" mechanism, i.e. servers and/or clients can pick one or > both, but would generally pick one over the other. > > This "bundle-uri" mechanism has the advantage of being dumber, and > offloads more complexity from the server side to the client > side. Generally, I like that using bundles presents an easier way to serve static content from an alternative source and then let Git's fetch negotiation catch up with the remainder. However, after inspecting your design and talking to some GitHub engineers who know more about CDNs and general internet things than I do, I want to propose an alternative design. I think this new design is simultaneously more flexible as well as promotes further decoupling of the origin Git server and the bundle contents. Your proposed design extends protocol v2 to let the client request a list of bundle URIs from the origin server. However, this still requires the origin server to know about this list. Further, your implementation focuses on the server side without integrating with the client. I propose that we flip this around. The "bundle server" should know which bundles are available at which URIs, and the client should contact the bundle server directly for a "table of contents" that lists these URIs, along with metadata related to each URI. The origin Git server then would only need to store the list of bundle servers and the URIs to their table of contents. The client could then pick from among those bundle servers (probably by ping time, or randomly) to start the bundle downloads. To summarize, there are two pieces here, that can be implemented at different times: 1. Create a specification for a "bundle server" that doesn't need to speak the Git protocol at all. This could be a REST API specification using well-established standards such as JSON for the table of contents. 2. Create a way for the origin Git server to advertise known bundle servers to clients so they can automatically benefit from faster downloads without needing to know about bundle servers. There are a few key benefits to this approach: * Further decoupling. The origin Git server doesn't need to know how the bundle server organizes its bundles. This allows maximum flexibility depending on whether the bundles are stored in something like a CDN (where bundles can't be too big) or some kind of blob storage (where they can have arbitrarily large size). * The bundle servers could be run completely independently from the origin Git server. Organizations could run their own bundle servers to host data in the same building as their build farms. As long as they can configure the bundle location at clone/fetch time, the origin Git server doesn't need to be involved. While I didn't go so far as to create a clear standard or implement a prototype in the Git codebase, I created a very simple prototype [1] using a python script that parses a JSON table of contents and downloads bundles into the Git repository. Then, I made a 'clone.sh' script that initializes a repository using the bundle fetcher and fetching the remainder from the origin Git server. I even computed static bundles for the git.git repository based on where 'master' has been over several days in the past month, to give an example of incremental bundles. You can test the approach all the way to including the fetch to github.com (note how the GitHub servers were not modified in any way for this). [1] https://github.com/derrickstolee/bundles There are a lot of limitations to the prototype, but it hopefully demonstrates the possibility of using something other than the Git protocol to solve these problems. Let me know if you are interested in switching your approach to something more like what I propose here. There are many more questions about what information could/should be located in the table of contents and how it can be extended in the future. I'm interested to explore that space with you. Thanks, -Stolee