Re: RFC on packfile URIs and .gitmodules check

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:

> Someone at $DAYJOB noticed that if a .gitmodules-containing tree and the
> .gitmodules blob itself are sent in 2 separate packfiles during a fetch
> (which can happen when packfile URIs are used), transfer.fsckobjects
> causes the fetch to fail. You can reproduce it as follows (as of the
> time of writing):
>
>   $ git -c fetch.uriprotocols=https -c transfer.fsckobjects=true clone https://chromium.googlesource.com/chromiumos/codesearch
>   Cloning into 'codesearch'...
>   remote: Total 2242 (delta 0), reused 2242 (delta 0)
>   Receiving objects: 100% (2242/2242), 1.77 MiB | 4.62 MiB/s, done.
>   error: object 1f155c20935ee1154a813a814f03ef2b3976680f: gitmodulesMissing: unable to read .gitmodules blob
>   fatal: fsck error in pack objects
>   fatal: index-pack failed
>
> This happens because the fsck part is currently being done in
> index-pack, which operates on one pack at a time. When index-pack sees
> the tree, it runs fsck on it (like any other object), and the fsck
> subsystem remembers the .gitmodules target (specifically, in
> gitmodules_found in fsck.c). Later, index-pack runs fsck_finish() which
> checks if the target exists, but it doesn't, so it reports the failure.

Is this because the gitmodules blob is contained in the base image
served via the pack URI mechansim, and the "dynamic" packfile for
the latest part of the history refers to the gitmodules file that is
unchanged, hence the latter one lacks it?

> Another option is to do it within the connectivity check instead - so,
> update rev-list and the object walking mechanism to be able to detect
> .gitmodules in trees and fsck the target blob whenever such an entry
> occurs. This has the advantage that there is no extra re-inflation,
> although it might be strange to have object walking be able to fsck.
>
> The simplest solution would be to just relax this - check the blob if it
> exists, but if it doesn't, it's OK. Some things in favor of this
> solution:
>
>  - This is something we already do in the partial clone case (although
>    it could be argued that in this case, we're already trusting the
>    server for far more than .gitmodules, so just because it's OK in the
>    partial clone case doesn't mean that it's OK in the regular case).
>
>  - Also, the commit message for this feature (from ed8b10f631 ("fsck: check
>    .gitmodules content", 2018-05-21)) gives a rationale of a newer
>    server being able to protect older clients.
>     - Servers using receive-pack (instead of fetch-pack) to obtain
>       objects would still be protected, since receive-pack still only
>       accepts one packfile at a time (and there are currently no plans
>       to expand this).
>     - Also, malicious .gitobjects files could still be crafted that pass
>       fsck checking - for example, by containing a URL (of another
>       server) that refers to a repo with a .gitobjects that would fail
>       fsck.
>
> So I would rather go with just relaxing the check, but if consensus is
> that we should still do it, I'll investigate doing it in the
> connectivity check.

You've listed two possible solutions, i.e.

 (1) punt and declare that we assume an missing and uncheckable blob
     is OK,

 (2) defer the check after transfer completes.

Between the two, my gut feeling is that the latter is preferrable.
If we assume an missing and uncheckable one is OK, then even if a
blob is available to be checked, there is not much point in
checking, no?

As long as the quarantine of incoming pack works correctly,
streaming the incoming packdata (and packfile downloaded out of line
via a separate mechanism like pack URI) to index-pack that does not
check to complete the transfer, with a separate step to check the
sanity of these packs as a whole, should not harm the repository
even if it is interrupted in the middle, after transfer is done but
before checking says it is OK.

As a potential third option, I wonder if it is easier for everybody
involved (including third-party implementation of their
index-pack/fsck equivalent) if we made it a rule that a pack that
has a tree that refers to .git<something> must include the blob for
it?

Thanks.



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux