Re: [PATCH 36/44] builtin/index-pack: add option to specify hash algorithm

Martin Ågren <martin.agren@xxxxxxxxx> · Sun, 17 May 2020 20:16:37 +0200

On Sat, 16 May 2020 at 22:47, brian m. carlson
<sandals@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On 2020-05-16 at 11:18:12, Martin Ågren wrote:
> > On Wed, 13 May 2020 at 02:56, brian m. carlson
> > <sandals@xxxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > git index-pack is usually run in a repository, but need not be. Since
> > > packs don't contains information on the algorithm in use, instead
> > > relying on context, add an option to index-pack to tell it which one
> > > we're using in case someone runs it outside of a repository.
>
> > Similar to an earlier patch where we modify `the_hash_algo` like this, I
> > feel a bit nervous. What happens if you pass in a "wrong" algo here,
> > i.e., SHA-1 in a SHA-256 repo? Or, given the motivation in the commit
> > message, should this only be allowed if we really *are* outside a repo?
>
> Unfortunately, we can't prevent the user from being inside repository A,
> which is SHA-1, while invoking git index-pack on repository B, which is
> SHA-256.

Ah, I see.

>  That is valid without --stdin, if uncommon, and it needs to be
> supported.  I can prevent it from being used with --stdin, though.

Hmm, that might make sense. I suppose it could quickly get out of
control with bug reports coming in along the lines of "if I do this
really crazy git index-pack invocation, I manage to mess things up". The
easiest way to address this might be through documentation, i.e., "don't
use this option", "for internal use" or even "to be used by the test
suite only" for which there is even precedence in git-index-pack(1).

On the other hand, if we need to detect such hash mismatch even once the
SHA-256 work is 100% complete, then I suppose we really should try a
bit to catch bad invocations.

As a tangent, I see that v2.27.0 will come with `git init
--object-format=<format>` and `GIT_DEFAULT_HASH_ALGORITHM`. The docs for
the former mentions "(if enabled)". Should we add something more scary
to those to make it clear that they shouldn't be used and that you
basically shouldn't even try to figure out how to enable them? I can
already see the tweets and blog posts a few weeks from now about how you
can build Git from source setting a single switch, run

  git init --object-format=sha256

and you're in the future! Which will just lead to pain some days or
weeks later.... "I've done lots of work. How do I convert my repo to
SHA-1 so I can share it?"...

We've added "experimental" things before and tried to document the
experimental nature. Maybe here we're not even "experimental" -- more
like "if you use this in production, you *will* suffer"?

> If you pass in a wrong algorithm, we usually blow up with an inflate
> error because we consume more bytes than expected with our ref deltas.
> I'm not aware of any cases where we segfault or access invalid memory;
> we just blow up in a nonobvious way.  That's true, too, if you manually
> tamper with the algorithm in extensions.objectformat; usually we blow up
> (but not segfault) because the index is "corrupt".

Ok, I see. I suppose "some time", we could tweak error messages to hint
about an object-format mismatch, but I don't think that needs to block
your work here now.

Martin