Re: [PATCH] fetch-pack: do not mix --pack_header and packfile uri

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 10 Mar 2021 10:30:40 -0800

Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:

> You mentioned --fix-thin (5) and --promisor (8). Why do you think that
> none of these should be given to the "index-pack" that processes the
> packfiles given by URI?

Actually, --fix-thin is probably even worse than that.

As the code processes the in-stream packdata before processing or
even downloading the pregenerated URI packfile, the objects
necessary to fix a "thin" in-stream packdata are likely to be
unavailable (it is exactly the same problem as the one that made us
to delay the fsckobjects done in index-pack when URI packfile is
involved, isn't it?).  Even if the client asks --thin, the server
side shouldn't produce a thin pack for in-stream packdata, no?

> Perhaps it could be argued that these extra
> packfiles don't need --fix-thin (but I would say that I think servers
> should be allowed to serve thin packfiles through URI too), 

I agree that URI packfile could be thin; after all, the server end
chooses, based on what the client claims to have, which pregenerated
packfile to hand out, so it is perfectly fine to hand out a
pregenerated packfile that is thin if the client asks for a thin
pack and says it has base objects missing from that packfile.  And
because it is (assumed to be) pregenerated, we can make a requirement
that no URI packfile should depend on objects that are created later
that that (which means it won't depend on in-stream packdata).

But we cannot process a thin in-stream packdata, if we are to
process it first, right?

> but I think
> that --promisor is necessary (so that a server could, for example,
> offload all trees and commits to a packfile in a CDN, and offload all
> blobs to a separate packfile in a CDN).

Yes, both packfiles conceptually are given by that same server who
promises to be always available to feed us everything we'd need
later, so both packfiles should be marked to have come from the same
promisor.  So this is one example that happens to be sharable
between the two.

But I do not see it as an indication that the two packs inherently
must be processed with the same options.

> Looking at this list, I think that all the arguments (except 9, which
> has been fixed) are necessary (or at least useful) for indexing a
> packfile given by URI.

I have to say that this is focusing too much on the current need by
going through how the current code handles two packs.  Of course, if
we start from "two must be the same" viewpoint, and restrict what
the code can do by "guarding" bits that require the two to be
different out based on "if (index_pack_args)", then the resulting
code would invoke two index-pack the same way.

I am more worried about the longer term code health, so "currently
mostly the same" does not make a convincing argument for the future
why the two must be processed the same way.

>> Also, because this loop copies everything in cmd.args, if our
>> in-stream packdata is small, cmd.args.v[0] would be "unpack-objects",
>> and we end up asking the command to explode the (presumably large
>> enough to be worth pre-generating and serving via CDN) packfile that
>> is given via the packfile URI mechanism.
>
> I specifically guard against this through the "if (do_keep ||
> args->from_promisor || index_pack_args || fsck_objects) {" line (which
> is a complicated line, unfortunately).

I am aware of that line that forbids the in-stream packdata from
getting unpacked into loose objects.  But unless we were told to
keep the resulting pack, or run fsck-objects via the index-pack, I
do not see an inherent reason why the "most recent leftover bits
that are not in the pregenerated pack offloaded to CDN" objects must
be kept in a separate packfile, especially if the number of objects
in it is smaller than the unpack limit threshold.  In other words,
I view that "guard" as one of the things that blinds us into thinking
that the two packs should be handled the same way.  It is the other
way around---the guard is there only because the code wanted to handle
the two packs the same way.

When cloning from a server that offers bulk of old history in a URI
packfile and an in-stream packfile, shouldn't the result be like
cloning from the server back when it had only the objects in the URI
packfile, and then fetching from it again when it acquired objects
that came in the in-stream packfile?  The objects that come during
the second fetch would be left loose if there aren't that many, so
that the third and subsequent fetches and local activity can
accumulate enough loose objects to be packed into a single new pack,
avoiding accumulation of too many tiny packs.  And the "guard"
breaks that only because this codepath wants to reuse cmd.args that
is unrelated to populate index_pack_args.  Isn't that an artificial
limitation that we may want to eventually fix?

When we want to fix that, the "options are mostly the same when we
use the index-pack command for both packdata, so let's copy the
entire command line" would come back and haunt us.  The person who
is doing the fix may be somebody other than you, so it may not
matter to you today, but it will hurt somebody tomorrow.

I already said that I think 2aec3bc4 (fetch-pack: do not mix
--pack_header and packfile uri, 2021-03-04) is OK as a short-term
fix for the upcoming release, but it does not change the fact that
it is piling technical debt on top of existing technical debt.

And that is why I am reacting against your earlier mention of
"filering out" rather strongly.  The approach continues the "keep
the single args array in the belief that two must be mostly the
same", which I view as a misguided starting point that must be
rethought.

Thanks.