On Wed, Feb 23 2022, Derrick Stolee via GitGitGadget wrote: [Note: The E-Mail address you CC'd for me (presumably, dropped in this reply) is not my E-Mail address, this one is] [Also CC-ing some people who have expressed interest in this are, and would probably like to be kept in the loop going forward] > There have been several suggestions to improve Git clone speeds and > reliability by supplementing the Git protocol with static content. The > Packfile URI [0] feature lets the Git response include URIs that point to > packfiles that the client must download to complete the request. > > Last year, Ævar suggested using bundles instead of packfiles [1] [2]. This > design has the same benefits to the packfile URI feature because it offloads > most object downloads to static content fetches. The main advantage over > packfile URIs is that the remote Git server does not need to know what is in > those bundles. The Git client tells the server what it downloaded during the > fetch negotiation afterwards. This includes any chance that the client did > not have access to those bundles or otherwise failed to access them. I > agreed that this was a much more desirable way to serve static content, but > had concerns about the flexibility of that design [3]. I have not heard more > on the topic since October, so I started investigating this idea myself in > December, resulting in this RFC. This timing is both quite fortunate & unfortunate for me, since I'd been blocked / waiting on various things until very recently to submit a non-RFC re-roll of (a larger version of) that series you mentioned from October. I guess the good news is that we'll have at least one guaranteed very interested reviewer for each other's patches, and that the design that makes it into git.git in the end will definitely be well hashed out :) I won't be able to review this in any detail right at this hour, but will be doing so. I'd also like to submit what I've got in some form soon for hashing the two out. It will be some 50+ patches on the ML in total though related to this topic, so I think the two of us coming up with some way to manage all of that for both ourselves & others would be nice. Perhaps we could also have an off-list (video) chat in real time to clarify/discuss various thing related to this. Having said that, basically: > I focused on maximizing flexibility for the service that organizes and > serves bundles. This includes: > > * Bundle URIs work for full and partial clones. > > * Bundle URIs can assist with git fetch in addition to git clone. > > * Users can set up bundle servers independent of the remote Git server if > they specify the bundle URI via a --bundle-uri argument. > > This series is based on the recently-submitted series that adds object > filters to bundles [4]. There is a slight adjacent-line-add conflict with > js/apply-partial-clone-filters-recursively, but that is in the last few > patches, so it will be easy to rebase by the time we have a fully-reviewable > patch series for those steps. > > The general breakdown is as follows: > > * Patch 1 adds documentation for the feature in its entirety. > > * Patches 2-14 add the ability to run ‘git clone --bundle-uri=’ > > * Patches 15-17 add bundle fetches to ‘git fetch’ calls > > * Patches 18-25 add a new ‘features’ capability that allows a server to > advertise bundle URIs (and in the future, other features). > > I consider the patches in their current form to be “RFC quality”. There are > multiple places where tests are missing or special cases are not checked. > The goal for this RFC is to seek feedback on the high-level ideas before > committing to the deep work of creating mergeable patches. Having skimmed through all of this a *very rough* overview of what you've got here & the direction I chose to go in is: 1. I didn't go for an initial step of teaching "git bundle" any direct remote operation, rather it's straight to the protocol v2 bits etc. I don't think there's anything wrong with that, but didn't see much point in teaching "git bundle" to do that when the eventual state is to have "git fetch" do so anyway. But in either case the "fetch" parts are either a thin wrapper for "git bundle fetch", or a "git bundle fetch/unbundle" is a thin equivalent to "init" "fetch" (with bundle-uri) + "unbundle". 2. By far the main difference is that you're heavily leaning on a TOC format which encodes certain assumptions that aren't true of clones/fetches in general (but probably are for most fetches), whereas my design (as we previously discussed) leans entirely on the client making sense of the bundle header & content itself. E.g. you have a "bundle.tableOfContents.forFetch", but e.g. if you've got a git.git clone of "master" and want to: git fetch origin refs/heads/todo:todo The assumption that we can cleanly separate "clone" from "fetch" breaks down. I.e. such a thing needs to assume that "clone" implies "you have most of the objects you need already" and that "fetch" means "..an incremental update thereof", doesn't it? Whereas I think (but we'll hash that out) that having a client fetch the bundle header and working that out via current reachability checks will be just as fast/faster, and such a thing is definitely more general/applicable to all sorts/types of fetches. (A TOC mechanism might still be good/valuable, but I hope it can be a cheap/discardable way to simply cache those bundle headers, or serve them up all at once) 3. Ditto "bundle.<id>.timestamp" in the design (presumably assumes not-rewound histories), and "requires" (can also currently be inferred from bundle headers). 4. I still need to go over your just-submitted "bundle filters" (https://lore.kernel.org/git/pull.1159.git.1645638911.gitgitgadget@xxxxxxxxx/) in detail but by adding a @filter to the file format (good!) presumably the "bundle.<id>.filter" amounts to a cache of the headers (which was 100% in line with any design I had for such extra information associated with a bundle). In (partial) summary: I really want to lean more heavily into the distributed nature of git in that a "bundle clone" be no more special than the same operation performed locally where "clone/fetch" is pointed-to a directory containing X number of local bundles, and has to make sense of whether those help with the clone/fetch operation. I.e. by parsing their headers & comparing that to the ref advertisement. Maybe a meta-format TOC will be needed eventually, and I'm not against such a thing. I'd just like to make sure we wouldn't add such a thing as a premature optimization or something that would needlessly complicate the design. In particular (quoting from a part of 01/25: +A further optimization is that the client can avoid downloading any +bundles if their timestamps are not larger than the stored timestamp. +After fetching new bundles, this local timestamp value is updated. Such caching seems sensible, but to me seems basically redundant to what you'd get by doing the same with just: * A set of dumb bundle files in a directory on a webserver * Having unique names for each of those (e.g. naming them https://<host>/<hash-of-content>.bundle instead of https://<host>/weekly.bundle) * Since the content wouldn't change (HTTP headers indicating caching forever) a client would have downloaded say the last 6 of your set of 7 "daily" rotating bundles already, and we'd locally cache their entire header, not just a timestamp. I.e. I think you'd get the same reduction in requests and more from that. I.e. (to go back to the earlier example) of: git fetch origin refs/heads/todo:todo You'd get the tip of "ls-refs" for TODO, and locally discover that one of the 6 "daily" bundles whose headers (but not necessarily content) you already downloaded had that advertised OID, and grab it from there. The critical difference being that such an arrangement would not be assuming linear history/additive only (i.e. only fast-forward) which the "forFetch" + "timestamp" surely does. And, I think we'll be much better off both in the short and long term by heavily leaning into HTTP caching features and things like request pipelining + range requests than a custom meta-index format. IOW is a TOC format needed if we assume for a moment for the sake of argument that for a given repository the say 100 bundles you'd potentially serve up aren't remote at all, but something you've got mmap()'d and can inspect the bundle headers for and compare with the remote "ls-refs"? Because if that's the case we could basically get to the same place via HTTP caching features, and doing it that way has the advantage of piggy-backing on all existing caching infrastructure. Have 1000 computers on your network that keep fetching torvalds/linux? Stick a proxy configured to cache the first say 1MB of <bundle-base-url> in front of them. Now all their requests to discover if the bundles help will be local (and it would probably make sense to cache the actual content too). Whereas any type of custom caching strategy would be per-git-client. Just food for thought, and sorry that this E-Mail/braindump got so long already...