Re: [PATCH v5 00/17] cruft packs

Taylor Blau <me@xxxxxxxxxxxx> · Tue, 24 May 2022 17:50:25 -0400

On Tue, May 24, 2022 at 12:39:11PM -0700, Jonathan Nieder wrote:
> Ævar Arnfjörð Bjarmason wrote:
> > On Fri, May 20 2022, Taylor Blau wrote:
> > 	On Mon, Nov 29 2021, Taylor Blau wrote:
>
> > 	> +static void write_mtimes_header(struct hashfile *f)
> > 	> +{
> > 	> +	hashwrite_be32(f, MTIMES_SIGNATURE);
> > 	> +	hashwrite_be32(f, MTIMES_VERSION);
> > 	> +	hashwrite_be32(f, oid_version(the_hash_algo));
> > 	> +}
> [...]
> > 	But since this is a new format I think it's worth considering not using
> > 	the 1 or 2 you get from oid_version(), but the "format_id",
> > 	i.e. GIT_SHA1_FORMAT_ID or GIT_SHA256_FORMAT_ID.
> >
> > 	You'll use the same space in the format for it, but we'll end up with
> > 	something more obvious (as the integer encodes the sha1 or sha256 name).
>
> Agreed.

I know we recommend using the format_id for on-disk formats, but I think
there is enough existing uses of "1" or "2" that either are acceptable
in practice.

E.g., grepping around for "hashwrite.*oid_version", there are three
existing formats that use "1" or "2" instead of the format_id. They are:

  - the commit-graph format
  - the midx format
  - the .rev format

Moreover, I can't seem to find any formats that _don't_ use that
convention. So I have a vague preference towards using the values "1"
and "2" as we currently do in these patches. (TBH, I don't find "sha1"
significantly more interpretable than just "1", so I would be just as
happy leaving it as-is).

> [...]
> > Other than that the only question I have (I think) on this series is if
> > Jonathan Nieder is happy with it. I looked back in my logs and there was
> > an extensive on-IRC discussion about it at the end of March, which ended
> > in you sending: https://lore.kernel.org/git/YkICkpttOujOKeT3@nand.local/
> >
> > But it seems Jonathan didn't chime in since then, and he had some major
> > issues with the approach here. I think those should have been addressed
> > by that discussion, but it would be nice to get a confirmation.
>
> I would still prefer if this used a repository format extension, but
> that preference is not strong enough that I'd say "this must not go in
> without one".  What I think would help would be some information in
> the user-facing documentation for commands that create and work with
> cruft packs.  In other words, if our take on people sharing
> repositories between implementations that understand and don't
> understand cruft packs and get objects moving back and forth between
> packed and loose objects is "you should have known you were doing
> something strange", the least we can do is to warn them.

I think that's a good suggestion. We already have some documentation in
Documentation/technical/cruft-packs.txt, but I think it could be helpful
to add user-facing documentation, too.

Would you be opposed to doing that outside of this series? ISTM that the
technical discussion has mostly settled, so I'd rather wordsmith the
user-facing documentation separately.

> I don't see a config to enable PACK_CRUFT by default yet in this
> series.  I'd like one, so that people can turn it on and get the good
> new behavior. :)

`git gc` has support for this (c.f., "gc.cruftPacks"). `git repack`
requires you to pass `--cruft`; IIRC I originally had a similar
configuration in `git repack` which would change the behavior of `-A` /
`-a` when set, but I found it too confusing and scrapped it.

Thanks,
Taylor