Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes: > You mentioned --fix-thin (5) and --promisor (8). Why do you think that > none of these should be given to the "index-pack" that processes the > packfiles given by URI? Actually, --fix-thin is probably even worse than that. As the code processes the in-stream packdata before processing or even downloading the pregenerated URI packfile, the objects necessary to fix a "thin" in-stream packdata are likely to be unavailable (it is exactly the same problem as the one that made us to delay the fsckobjects done in index-pack when URI packfile is involved, isn't it?). Even if the client asks --thin, the server side shouldn't produce a thin pack for in-stream packdata, no? > Perhaps it could be argued that these extra > packfiles don't need --fix-thin (but I would say that I think servers > should be allowed to serve thin packfiles through URI too), I agree that URI packfile could be thin; after all, the server end chooses, based on what the client claims to have, which pregenerated packfile to hand out, so it is perfectly fine to hand out a pregenerated packfile that is thin if the client asks for a thin pack and says it has base objects missing from that packfile. And because it is (assumed to be) pregenerated, we can make a requirement that no URI packfile should depend on objects that are created later that that (which means it won't depend on in-stream packdata). But we cannot process a thin in-stream packdata, if we are to process it first, right? > but I think > that --promisor is necessary (so that a server could, for example, > offload all trees and commits to a packfile in a CDN, and offload all > blobs to a separate packfile in a CDN). Yes, both packfiles conceptually are given by that same server who promises to be always available to feed us everything we'd need later, so both packfiles should be marked to have come from the same promisor. So this is one example that happens to be sharable between the two. But I do not see it as an indication that the two packs inherently must be processed with the same options. > Looking at this list, I think that all the arguments (except 9, which > has been fixed) are necessary (or at least useful) for indexing a > packfile given by URI. I have to say that this is focusing too much on the current need by going through how the current code handles two packs. Of course, if we start from "two must be the same" viewpoint, and restrict what the code can do by "guarding" bits that require the two to be different out based on "if (index_pack_args)", then the resulting code would invoke two index-pack the same way. I am more worried about the longer term code health, so "currently mostly the same" does not make a convincing argument for the future why the two must be processed the same way. >> Also, because this loop copies everything in cmd.args, if our >> in-stream packdata is small, cmd.args.v[0] would be "unpack-objects", >> and we end up asking the command to explode the (presumably large >> enough to be worth pre-generating and serving via CDN) packfile that >> is given via the packfile URI mechanism. > > I specifically guard against this through the "if (do_keep || > args->from_promisor || index_pack_args || fsck_objects) {" line (which > is a complicated line, unfortunately). I am aware of that line that forbids the in-stream packdata from getting unpacked into loose objects. But unless we were told to keep the resulting pack, or run fsck-objects via the index-pack, I do not see an inherent reason why the "most recent leftover bits that are not in the pregenerated pack offloaded to CDN" objects must be kept in a separate packfile, especially if the number of objects in it is smaller than the unpack limit threshold. In other words, I view that "guard" as one of the things that blinds us into thinking that the two packs should be handled the same way. It is the other way around---the guard is there only because the code wanted to handle the two packs the same way. When cloning from a server that offers bulk of old history in a URI packfile and an in-stream packfile, shouldn't the result be like cloning from the server back when it had only the objects in the URI packfile, and then fetching from it again when it acquired objects that came in the in-stream packfile? The objects that come during the second fetch would be left loose if there aren't that many, so that the third and subsequent fetches and local activity can accumulate enough loose objects to be packed into a single new pack, avoiding accumulation of too many tiny packs. And the "guard" breaks that only because this codepath wants to reuse cmd.args that is unrelated to populate index_pack_args. Isn't that an artificial limitation that we may want to eventually fix? When we want to fix that, the "options are mostly the same when we use the index-pack command for both packdata, so let's copy the entire command line" would come back and haunt us. The person who is doing the fix may be somebody other than you, so it may not matter to you today, but it will hurt somebody tomorrow. I already said that I think 2aec3bc4 (fetch-pack: do not mix --pack_header and packfile uri, 2021-03-04) is OK as a short-term fix for the upcoming release, but it does not change the fact that it is piling technical debt on top of existing technical debt. And that is why I am reacting against your earlier mention of "filering out" rather strongly. The approach continues the "keep the single args array in the belief that two must be mostly the same", which I view as a misguided starting point that must be rethought. Thanks.