Re: [PATCH 1/6] docs: document bundle URI standard

Derrick Stolee <derrickstolee@xxxxxxxxxx> · Thu, 9 Jun 2022 12:00:27 -0400

On 6/8/2022 5:01 PM, Junio C Hamano wrote:
> Derrick Stolee <derrickstolee@xxxxxxxxxx> writes:
>>> That sounds quite straight-forward.  Do you envision that their
>>> incremental snapshot packfile chains can somehow be shared with the
>>> bundle URI implementations?  Doesn't it make it more cumbersome that
>>> this proposal uses the bundles as the encapsulation format, rather
>>> than packfiles?  As you are sending extra pieces of information on
>>> top of the payload in the form of table-of-contents already, I
>>> wonder if bundle.<id>.uri should point at a bare packfile (instead
>>> of a bundle), while multi-valued bundle.<id>.prerequisite give the
>>> prerequisite objects?  The machinery that is already generating the
>>> prefetch packfiles already know which packfile has what
>>> prerequisites in it, so it rather looks simpler if the solution did
>>> not involve bundles.
>>
>> The prefetch packfiles could be replaced with bundle URIs, if desired.
>> ...
>> So in this world, the bundle URIs could be used as a replacement for
>> downloading these prefetch packfiles (bundles with filter=blob:none)
>> but the bundled refs become useless to the client.
> 
> That's all understandable, but what I was alluding to was to go in
> the other direction.  Since "bundle URI" thing is new, while the
> GVFS Cache Servers already use these prefetch packfiles, it could be
> beneficial if the new thing can be done without bundle files and
> instead with packfiles.  You are already generating these snapshot
> packfiles for GVFS Cache Servers.  So if we can reuse them to also
> serve "git clone" and "git fetch" clients, we can do so without
> doubling the disk footprint.

Now I'm confused as to what you are trying to say, so let me back
up and start from the beginning. Hopefully, that brings clarity so
we can get to the root of my confusion.

The GVFS Cache Servers started as a way to have low-latency per-object
downloads to satisfy the filesystem virtualization feature of the
clients. This initially was going to be the _only_ way clients got
objects until we realized that commit and tree "misses" are very
expensive.

So, the "prefetch packfile" system was developed to use timestamp-
based packs that contain commits and trees. Clients would provide
their latest timestamp and the servers would provide the list of
packfiles to download.

Because the GVFS Protocol still has the "download objects on-demand"
feature, any objects that were needed that were not already in those
prefetch packfiles (including recently-pushed commits and trees)
could be downloaded by the clients on-demand.

This has been successful in production, and in particular is helpful
that cache servers can be maintained completely independently of the
origin Git server. There is some configuration to allow the origin
server to advertise the list of cache servers via the <url>/gvfs/config
REST API, but otherwise they are completely independent.

For years, I've been interested in bringing this kind of functionality
to Git proper, but struggled on multiple fronts:

1. The independence of the cache servers could not use packfile-URIs.

2. The way packfile-URIs happens _within_ a fetch negotiation makes it
   hard to integrate even if we didn't have this independence.

3. If the Git client directly downloaded these packfiles from the
   cache server, then how does it get the remaining objects from the
   origin server?

Ævar's observation that bundles also add ref tips to the packfile is
the key to breaking down this concern: these ref tips give us a way
to negotiate the difference between what the client already has
(including the bundles downloaded from a bundle provider) and what it
wants from the origin Git server. This all happens without any change
necessary to the origin Git server.

And thus, this bundle URI design came about. It takes all of the best
things about the GVFS Cache Server but then layers refs on top of the
time-based prefetch packfiles so a normal Git client can do that
"catch-up fetch" afterwards.

This motivated my "could we use the new bundle URI feature in the
old GVFS Cache Server environment?" comment:

I could imagine updating GVFS Cache Servers to generate bundles
instead (or also) and updating the VFS for Git clients to use the
bundle URI feature to download the data. However, for the sake of not
overloading the origin server with those incremental fetches, we would
probably keep the "only download missing objects on-demand" feature
in that environment. (Hence, the refs are useless to those clients.)

However, you seem to be hinting at "the GVFS Cache Servers seem to
work just fine, so why do we need bundles?" but I think that the
constraints of what is expected at the end of "git clone" or "git
fetch" require us to not "catch up later" and instead complete the
full download during the process. The refs in the bundles are critical
to making that work.

> Even if you scrapped the "bundle URI" and rebuilt it as the
> "packfile URI" mechanism, the only change you need is to make
> positive and negative refs, which were available in bundle files but
> not stored in packfiles, available as a part of the metadata for
> each packfile, no?  You'd be keeping track of associated metadata
> (like the .timestamp and .requires fields) in addition to what is in
> the bundle anyway, so...

>From this comment, it seems you are suggesting that we augment the
packfile data being served by the packfile-URI feature in order to
include those positive/negative refs (as well as other metadata that
is included in the bundle URI design).

I see two major issues with that:

1. We don't have a way to add that metadata directly into packfiles,
   so we would need to update the file standard or update the
   packfile-URI protocol to include that metadata.

2. The only source of packfile-URI listings come as a response to the
   "git fetch" request to the origin Git server, so there is no way
   to allow an independent server to provide that data.

I hope I am going in the right direction here, but I likely
misunderstood some of your proposed alternatives.

Thanks,
-Stolee