Re: RFC on packfile URIs and .gitmodules check

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jan 15, 2021 at 04:30:07PM -0800, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:
>
> > Someone at $DAYJOB noticed that if a .gitmodules-containing tree and the
> > .gitmodules blob itself are sent in 2 separate packfiles during a fetch
> > (which can happen when packfile URIs are used), transfer.fsckobjects
> > causes the fetch to fail. You can reproduce it as follows (as of the
> > time of writing):
> >
> >   $ git -c fetch.uriprotocols=https -c transfer.fsckobjects=true clone https://chromium.googlesource.com/chromiumos/codesearch
> >   Cloning into 'codesearch'...
> >   remote: Total 2242 (delta 0), reused 2242 (delta 0)
> >   Receiving objects: 100% (2242/2242), 1.77 MiB | 4.62 MiB/s, done.
> >   error: object 1f155c20935ee1154a813a814f03ef2b3976680f: gitmodulesMissing: unable to read .gitmodules blob
> >   fatal: fsck error in pack objects
> >   fatal: index-pack failed
> >
> > This happens because the fsck part is currently being done in
> > index-pack, which operates on one pack at a time. When index-pack sees
> > the tree, it runs fsck on it (like any other object), and the fsck
> > subsystem remembers the .gitmodules target (specifically, in
> > gitmodules_found in fsck.c). Later, index-pack runs fsck_finish() which
> > checks if the target exists, but it doesn't, so it reports the failure.
>
> Is this because the gitmodules blob is contained in the base image
> served via the pack URI mechansim, and the "dynamic" packfile for
> the latest part of the history refers to the gitmodules file that is
> unchanged, hence the latter one lacks it?

That seems like a likely explanation, although this seems ultimately up
to what the pack CDN serves.
> You've listed two possible solutions, i.e.
>
>  (1) punt and declare that we assume an missing and uncheckable blob
>      is OK,
>
>  (2) defer the check after transfer completes.
>
> Between the two, my gut feeling is that the latter is preferrable.
> If we assume an missing and uncheckable one is OK, then even if a
> blob is available to be checked, there is not much point in
> checking, no?

I'm going to second this. If this were a more benign check, then I'd
perhaps feel differently, but .gitmodules fsck checks seem to get
hardened fairly often during security releases, and so it seems
important to keep performing them when the user asked for it.

> As long as the quarantine of incoming pack works correctly,
> streaming the incoming packdata (and packfile downloaded out of line
> via a separate mechanism like pack URI) to index-pack that does not
> check to complete the transfer, with a separate step to check the
> sanity of these packs as a whole, should not harm the repository
> even if it is interrupted in the middle, after transfer is done but
> before checking says it is OK.

Agreed. Bear in mind that I am pretty unfamiliar with this code, and so
I'm not sure if it's 'easy' or not to change it in this way. The obvious
downside, which Jonathan notes, is that you almost certainly have to
reinflate all of the trees again.

But, since the user is asking for transfer.fsckObjects explicitly, I
don't think that it's a problem.

> As a potential third option, I wonder if it is easier for everybody
> involved (including third-party implementation of their
> index-pack/fsck equivalent) if we made it a rule that a pack that
> has a tree that refers to .git<something> must include the blob for
> it?

Interesting, but I'm sure CDN administrators would prefer to have as few
restrictions in place as possible.

A potential fourth option that I can think of is that we can try to
eagerly perform the .gitmodules fsck checks as we receive objects, under
the assumption that the .gitmoudles blob and the tree which contains it
appear in the same pack.

If they do, then we ought to be able to check them as we currently do
(and avoid leaving them to the slow post-processing step). Any blobs
that we _can't_ find get placed into an array, and then that array is
iterated over after we have received all packs, including from the CDN.
Any blobs that couldn't be found in the pack transferred from the
remote, the CDN, or the local repository (and isn't explicitly excluded
via an object --filter) is declared missing.

Thoughts?

> Thanks.

Thanks,
Taylor



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux