On 6/8/2022 5:01 PM, Junio C Hamano wrote: > Derrick Stolee <derrickstolee@xxxxxxxxxx> writes: >>> That sounds quite straight-forward. Do you envision that their >>> incremental snapshot packfile chains can somehow be shared with the >>> bundle URI implementations? Doesn't it make it more cumbersome that >>> this proposal uses the bundles as the encapsulation format, rather >>> than packfiles? As you are sending extra pieces of information on >>> top of the payload in the form of table-of-contents already, I >>> wonder if bundle.<id>.uri should point at a bare packfile (instead >>> of a bundle), while multi-valued bundle.<id>.prerequisite give the >>> prerequisite objects? The machinery that is already generating the >>> prefetch packfiles already know which packfile has what >>> prerequisites in it, so it rather looks simpler if the solution did >>> not involve bundles. >> >> The prefetch packfiles could be replaced with bundle URIs, if desired. >> ... >> So in this world, the bundle URIs could be used as a replacement for >> downloading these prefetch packfiles (bundles with filter=blob:none) >> but the bundled refs become useless to the client. > > That's all understandable, but what I was alluding to was to go in > the other direction. Since "bundle URI" thing is new, while the > GVFS Cache Servers already use these prefetch packfiles, it could be > beneficial if the new thing can be done without bundle files and > instead with packfiles. You are already generating these snapshot > packfiles for GVFS Cache Servers. So if we can reuse them to also > serve "git clone" and "git fetch" clients, we can do so without > doubling the disk footprint. Now I'm confused as to what you are trying to say, so let me back up and start from the beginning. Hopefully, that brings clarity so we can get to the root of my confusion. The GVFS Cache Servers started as a way to have low-latency per-object downloads to satisfy the filesystem virtualization feature of the clients. This initially was going to be the _only_ way clients got objects until we realized that commit and tree "misses" are very expensive. So, the "prefetch packfile" system was developed to use timestamp- based packs that contain commits and trees. Clients would provide their latest timestamp and the servers would provide the list of packfiles to download. Because the GVFS Protocol still has the "download objects on-demand" feature, any objects that were needed that were not already in those prefetch packfiles (including recently-pushed commits and trees) could be downloaded by the clients on-demand. This has been successful in production, and in particular is helpful that cache servers can be maintained completely independently of the origin Git server. There is some configuration to allow the origin server to advertise the list of cache servers via the <url>/gvfs/config REST API, but otherwise they are completely independent. For years, I've been interested in bringing this kind of functionality to Git proper, but struggled on multiple fronts: 1. The independence of the cache servers could not use packfile-URIs. 2. The way packfile-URIs happens _within_ a fetch negotiation makes it hard to integrate even if we didn't have this independence. 3. If the Git client directly downloaded these packfiles from the cache server, then how does it get the remaining objects from the origin server? Ævar's observation that bundles also add ref tips to the packfile is the key to breaking down this concern: these ref tips give us a way to negotiate the difference between what the client already has (including the bundles downloaded from a bundle provider) and what it wants from the origin Git server. This all happens without any change necessary to the origin Git server. And thus, this bundle URI design came about. It takes all of the best things about the GVFS Cache Server but then layers refs on top of the time-based prefetch packfiles so a normal Git client can do that "catch-up fetch" afterwards. This motivated my "could we use the new bundle URI feature in the old GVFS Cache Server environment?" comment: I could imagine updating GVFS Cache Servers to generate bundles instead (or also) and updating the VFS for Git clients to use the bundle URI feature to download the data. However, for the sake of not overloading the origin server with those incremental fetches, we would probably keep the "only download missing objects on-demand" feature in that environment. (Hence, the refs are useless to those clients.) However, you seem to be hinting at "the GVFS Cache Servers seem to work just fine, so why do we need bundles?" but I think that the constraints of what is expected at the end of "git clone" or "git fetch" require us to not "catch up later" and instead complete the full download during the process. The refs in the bundles are critical to making that work. > Even if you scrapped the "bundle URI" and rebuilt it as the > "packfile URI" mechanism, the only change you need is to make > positive and negative refs, which were available in bundle files but > not stored in packfiles, available as a part of the metadata for > each packfile, no? You'd be keeping track of associated metadata > (like the .timestamp and .requires fields) in addition to what is in > the bundle anyway, so... >From this comment, it seems you are suggesting that we augment the packfile data being served by the packfile-URI feature in order to include those positive/negative refs (as well as other metadata that is included in the bundle URI design). I see two major issues with that: 1. We don't have a way to add that metadata directly into packfiles, so we would need to update the file standard or update the packfile-URI protocol to include that metadata. 2. The only source of packfile-URI listings come as a response to the "git fetch" request to the origin Git server, so there is no way to allow an independent server to provide that data. I hope I am going in the right direction here, but I likely misunderstood some of your proposed alternatives. Thanks, -Stolee