Re: [PATCH 34/44] builtin/ls-remote: initialize repository based on fetch

"brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> · Sat, 16 May 2020 20:28:55 +0000

On 2020-05-16 at 11:16:46, Martin Ågren wrote:
> On Wed, 13 May 2020 at 02:58, brian m. carlson
> <sandals@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > ls-remote may or may not operate within a repository, and as such will
> > not have been initialized with the repository's hash algorithm.  Even if
> > it were, the remote side could be using a different algorithm and we
> > would still want to display those refs properly.  Find the hash
> > algorithm used by the remote side by querying the transport object and
> > set our hash algorithm accordingly.
> >
> > Without this change, if the remote side is using SHA-256, we truncate
> > the refs to 40 hex characters, since that's the length of the default
> > hash algorithm (SHA-1).
> 
> Could we add a test that passes now but would have failed before?

The existing tests that call "git ls-remote" actually fail with SHA-256
if we don't do this, specifically "ls-remote works outside repository"
in t5512.  That's the thing with a lot of this series: our existing test
suite is enormously effective at catching these things, but writing a
new test is hard because we can't actually instantiate a SHA-256
repository (because then users could, and it's broken until the end of
the series).  Perhaps unsurprisingly, that's how I found this problem.

So while I would love to write a test for this case, I can't without
allowing users to corrupt and destroy their data in the mean time (or
tacking the final six commits to this series).

> >         ref = transport_get_remote_refs(transport, &ref_prefixes);
> > +       if (ref) {
> > +               int hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
> > +               repo_set_hash_algo(the_repository, hash_algo);
> > +       }
> 
> This will modify `the_hash_algo`. Quoting commit 78a6766802 ("Integrate
> hash algorithm support with repo setup", 2017-11-12):
> 
>   Add a constant, the_hash_algo, which points to the hash_algo structure
>   pointer in the repository global.  Note that this is the hash which is
>   used to serialize data to disk, not the hash which is used to display
>   items to the user.  The transition plan anticipates that these may be
>   different.  We can add an additional element in the future (say,
>   ui_hash_algo) to provide for this case.
> 
> Don't we violate that here? Is it mostly luck that we can go on to list
> what we want to list and that we will never write to disk based on
> `the_hash_algo` being "wrong"(?)? Or am I missing something?

We do violate that and we also rely on it never having any effect on our
current repository.  Unfortunately, as things stand now, we don't
support multiple hash algorithms in the same running binary, and we
can't until we allow a member of struct object_id to vary based on the
hash algorithm.  That work is coming in a future series (after we have a
fully functioning SHA-256 stage 4 implementation), but at this point,
I'm still working through all of the crashes we get from random places
where we make assumptions about initializing things, so it's not a
straightforward fix.

For now, I think this is the best we can do without major additional
surgery to the codebase.  I'm fine with stating that git ls-remote can
read the repository (to parse remotes) but can't write to it, since
that's the behavior users will expect anyway.  I'll update the commit
message to reflect that wart and assumption, since it would be good to
document it.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204
Attachment:
signature.asc

Description: PGP signature