Re: Submodules and SHA-256/SHA-1 interoperability

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Mon, 1 Mar 2021 20:28:13 +0100 (CET)

Hi brian,

On Sun, 14 Feb 2021, brian m. carlson wrote:

> I'm currently working on the next step of the SHA-256 transition code,
> which is SHA-256/SHA-1 interoperability.  Essentially, when we write a
> loose object into the store, or when we index a pack, we take one form
> of the object, usually the SHA-256 form, and rewrite it so that it is in
> its SHA-1 form, and then hash it to determine its SHA-1 name.  We then
> write this correspondence either into the loose object index (for loose
> objects) or a v3 index (for packs).
>
> Blobs are simply hashed with both algorithms, but trees, commits, and
> tags need to be rewritten to use the SHA-1 names of the objects they
> refer to.  For most situations, we already have this data, since it will
> exist in the loose object index, in some pack index, or elsewhere in the
> pack we're indexing.
>
> However, for submodules, we have a problem.  By definition, the object
> exists in a different repository.  If we have the submodule locally on
> the system, then this works fine, but if we're performing a fetch or
> clone and the submodule is not present, then we cannot rewrite the tree
> or anything that refers to it, directly or indirectly.
>
> So there are some possible courses of action:
>
> * Disallow compatibility algorithms when using submodules.  This is
>   simple, but inconvenient.
> * Force users to always clone submodules and fetch them before fetching
>   the main repository.  This is also relatively simple, but
>   inconvenient.
> * Have the remote server keep a list of correspondences and send them in
>   a protocol extension.
> * Just skip rewriting objects until the data is filled in later and
>   admit the data will be incomplete.  This means that pushing to or
>   pulling from a repository using a incompatible algorithm will be
>   impossible.
> * Something else I haven't thought of.

While my strong urge is to add "Remove support for submodules" (which BTW
would also plug so many attack vectors that have lead to many a
vulnerability in the past), I understand that this would be impractical:
the figurative barn door has been open for way too long to do that.

But I'd like to put another idea into the fray: store the mapping in
`.gitmodules`. That is, each time `git submodule add <...>` is called, it
would update `.gitmodules` to list SHA-1 *and* SHA-256 for the given path.

That would relieve us of the problem where we rely on a server's ability
to give us that mapping.

Ciao,
Dscho

> The third option is where I'm leaning, but it has some potential
> downsides.  First, the server must support both hash algorithms and have
> this data.  Second, it essentially requires all submodule updates to be
> pushed from a compatible client.  Third, we need to trust that the
> server hasn't tampered with the data, which should be possible by doing
> an fsck on both forms (I think).  Fourth, we need to store this
> somewhere, and the only place we have right now is the loose object
> index, which would potentially grow to inefficient sizes.
>
> We could potentially change this to be slightly different by asking the
> submodule server for a list of correspondences instead via a new
> protocol extension, but it has the same downsides except for the second
> one, and additionally means that we'd need to make multiple connections.
>
> So I'm seeking some ideas on which approach we want to use here before
> I start sinking a lot of work into this.
> --
> brian m. carlson (he/him or they/them)
> Houston, Texas, US
>