Re: Questions about the hash function transition

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Tue, 28 Aug 2018 17:02:18 +0200

On Tue, Aug 28 2018, Edward Thomson wrote:

> On Tue, Aug 28, 2018 at 2:50 PM, Ævar Arnfjörð Bjarmason
> <avarab@xxxxxxxxx> wrote:
>> If we instead had something like clean/smudge filters:
>>
>>     [extensions]
>>         objectFilter = sha256-to-sha1
>>         compatObjectFormat = sha1
>>     [objectFilter "sha256-to-sha1"]
>>         clean  = ...
>>         smudge = ...
>>
>> We could apply arbitrary transformations on objects through filters
>> which would accept/return some simple format requesting them to
>> translate such-and-such objects, and would either return object
>> names/types under which to store them, or "nothing to do".
>
> If I'm understanding you correctly, then on the libgit2 side, I'm very much
> opposed to this proposal.  We never execute commands, nor do I want to start
> thinking that we can do so arbitrarily.  We run in environments where that's
> a non-starter

I'm being unclear. I'm suggesting that we slightly amend the syntax of
what we're proposing to put in the .git/config to leave the door open
for *optionally* doing arbitrary mappings.

It would still work exactly the same internally for the common
sha1<->sha256 case, i.e. neither git, libgit, jgit or anyone else would
need to shell out to anything.

They'd just pick up that common case and handle it internally, similar
to how e.g. the crlf filter (v.s. full clean/smudge support) works in
git & libgit2:
https://github.com/libgit2/libgit2/blob/master/tests/filter/crlf.c

So the sha256<->sha1 support would be an implicit built-in like crlf, it
would just leave the door open to having something like git-lfs.

Now what does that really mean? And I admit I may be missing something
here.

Unlike smudge/clean filters we're going to be constrained by having
hashes of length 20 or 32, locally & remotely, since we wouldn't want to
support arbitrary lengths, but with relatively small changes it'll allow
for changing just:

    # local  remote
    sha256<->sha1

To also support:

    # local  remote
    fn(sha1)<->fn(sha1)
    fn(sha1)<->fn(sha256)
    fn(sha256)<->fn(sha1)
    fn(sha256)<->fn(sha256)

Where fn() is some hook you'd provide to hook into the bits where we're
e.g. unpacking SHA-1 objects from the remote, and writing them locally
as SHA-256, except instead of (as we do by default) writing:

    SHA256_map(sha256(content)) = content

You'd write:

    SHA256_map(sha256(fn(content))) = fn(content)

Where fn() would need to be idempotent.

Now, why is this useful or worth considering? As noted in the E-Mail I
linked to it allows for some novel use cases for doing local to remote
object translation.

But really, I'm not suggesting that *that* is something we should
consider. *All* I'm saying is that given the experience of how we
started out with stuff like built-in "crlf", and then grew smudge/clean
filters, that it's worth considering what sort of .git/config key-value
pairs we'd pick that would yield themselves to such future extensions,
should that be something we deem to be a good idea in the future.

Because if we don't we've lost nothing, but if we do we'd need to
support two sets of config syntaxes to do those two related things.

> At present, in libgit2, users can provide their own mechanism for running
> clean/smudge filters.  But hash transformation / compatibility is going to
> be a crucial compatibility component.  So this is not something that we
> could simply opt out of or require users to implement themselves.

Indeed.