Re: RFC on packfile URIs and .gitmodules check

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Thu, 21 Jan 2021 10:32:38 -0800

> Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:
> 
> > We wouldn't be OK, actually. Suppose we have a separate packfile
> > containing only the ".gitmodules" blob - when we call fsck_finish(), we
> > would not have downloaded the other packfile yet. Git processes the
> > entire fetch response by piping the inline packfile (after demux) into
> > index-pack (which is the one that calls fsck_finish()) before it
> > downloads any of the other packfile(s).
> 
> Is that order documented as a requirement for implementation?
> 
> Naïvely, I would expect that a CDN offload would be to relieve
> servers from the burden of having to repack ancient part of the
> history all the time for any new "clone" clients and that is what
> the "here is a URI, go fetch it because I won't give you objects
> that already appear there" feature is about.  Because we expect that
> the offloaded contents would not be up-to-date, the traditional
> packfile transfer would then is used to complete the history with
> objects necessary for the parts of the history newer than the
> offloaded contents.
> 
> And from that viewpoint, it sounds totally backwards to start
> processing the up-to-the-minute fresh packfile that came via the
> traditional packfile transfer before the CDN offloaded contents are
> fetched and stored safely in our repository.
> 
> We probably want to finish interaction with the live server as
> quickly as possible---it would go counter to that wish if we force
> the live part of the history hang in flight, unprocessed, while the
> client downloads offloaded bulk from CDN and processes it, making
> the server side stuck waiting for some write(2) to go through.
> 
> But I still wonder if it is an option to locally delay the
> processing of the up-to-the-minute-fresh part.
> 
> Instead of feeding what comes from them directly to "index-pack
> --fsck-objects", would it make sense to spool it to a temporary, so
> that we can release the server early, but then make sure to fetch
> and process packfile URI material before coming back to process the
> spooled packdata.  That would allow the newer part of the history to
> have newer trees that still reference the same old .gitmodules that
> is found in the frozen packfile that comes from CDN, no?
> 
> Or can there be a situation where some objects in CDN pack are
> referred to by objects in the up-to-the-minute-fresh pack (e.g. a
> ".gitmodules" blob in CDN pack is still unchanged and used in an
> updated tree in the latest revision) and some other objects in CDN
> pack refer to an object in the live part of the history?  If there
> is such a cyclic dependency, "index-pack --fsck" one pack at a time
> would not work, but I doubt such a cycle can arise.

My intention is that the order of the packfiles (and cyclic
dependencies) would not matter, so we wouldn't need to delay any
processing of the up-to-the-minute-fresh part. I'm currently working on
getting index-pack to output a list of the dangling .gitmodules files,
so that fetch-pack (its consumer) can do one final fsck on those files.

Another way, as you said, is to say that the order of the packfiles
matters (which potentially allows some simplification on the client
side) but I don't think that we need to lose this flexibility.