> > Is this because the gitmodules blob is contained in the base image > > served via the pack URI mechansim, and the "dynamic" packfile for > > the latest part of the history refers to the gitmodules file that is > > unchanged, hence the latter one lacks it? > > That seems like a likely explanation, although this seems ultimately up > to what the pack CDN serves. In this case, yes, that is what is happening. > > You've listed two possible solutions, i.e. > > > > (1) punt and declare that we assume an missing and uncheckable blob > > is OK, > > > > (2) defer the check after transfer completes. > > > > Between the two, my gut feeling is that the latter is preferrable. > > If we assume an missing and uncheckable one is OK, then even if a > > blob is available to be checked, there is not much point in > > checking, no? > > I'm going to second this. If this were a more benign check, then I'd > perhaps feel differently, but .gitmodules fsck checks seem to get > hardened fairly often during security releases, and so it seems > important to keep performing them when the user asked for it. That makes sense. > > As long as the quarantine of incoming pack works correctly, > > streaming the incoming packdata (and packfile downloaded out of line > > via a separate mechanism like pack URI) to index-pack that does not > > check to complete the transfer, with a separate step to check the > > sanity of these packs as a whole, should not harm the repository > > even if it is interrupted in the middle, after transfer is done but > > before checking says it is OK. > > Agreed. Bear in mind that I am pretty unfamiliar with this code, and so > I'm not sure if it's 'easy' or not to change it in this way. The obvious > downside, which Jonathan notes, is that you almost certainly have to > reinflate all of the trees again. > > But, since the user is asking for transfer.fsckObjects explicitly, I > don't think that it's a problem. We might be able to avoid the reinflate if we do it as part of the connectivity check or somehow teach index-pack a way to communicate the dangling .gitmodules links (as you suggest below). > > As a potential third option, I wonder if it is easier for everybody > > involved (including third-party implementation of their > > index-pack/fsck equivalent) if we made it a rule that a pack that > > has a tree that refers to .git<something> must include the blob for > > it? > > Interesting, but I'm sure CDN administrators would prefer to have as few > restrictions in place as possible. That rule would help, but it also seems inelegant in that if we put commits that have the same .gitmodules in 2 or more different packs, there would be identical objects across those packs (besides the reason Taylor mentioned). > A potential fourth option that I can think of is that we can try to > eagerly perform the .gitmodules fsck checks as we receive objects, under > the assumption that the .gitmoudles blob and the tree which contains it > appear in the same pack. > > If they do, then we ought to be able to check them as we currently do > (and avoid leaving them to the slow post-processing step). Any blobs > that we _can't_ find get placed into an array, and then that array is > iterated over after we have received all packs, including from the CDN. > Any blobs that couldn't be found in the pack transferred from the > remote, the CDN, or the local repository (and isn't explicitly excluded > via an object --filter) is declared missing. > > Thoughts? The hard part is communicating this array to the parent fetch process. Stolee has a suggestion [1] which I will reply to directly. [1] https://lore.kernel.org/git/d2ca2fec-a353-787a-15a7-3831a665523e@xxxxxxxxx/