Re: [PATCH 20/20] abbrev: add a core.validateAbbrev setting

Martin Ågren <martin.agren@xxxxxxxxx> · Sat, 9 Jun 2018 17:47:18 +0200

On 9 June 2018 at 00:41, Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> wrote:

> Instead of trying really hard to find an unambiguous SHA-1 we can with
> core.validateAbbrev=false, and preferably combined with the new
> support for relative core.abbrev values (such as +4) unconditionally
> print a short SHA-1 without doing any disambiguation check. I.e. it

This first paragraph read weirdly the first time. On the second attempt
I knew how to structure it and got it right. It might be easier to read
if the part about core.appreb=+4 were in a separate second sentence.

That last "it" is "the combination of these two configs", right?

> allows for picking a trade-off between performance, and the odds that
> future or remote (or current and local) short SHA-1 will be ambiguous.

> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index abf07be7b6..df31d1351f 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -925,6 +925,49 @@ means to add or subtract N characters from the SHA-1 that Git would
>  otherwise print, this allows for producing more future-proof SHA-1s
>  for use within a given project, while adjusting the value for the
>  current approximate number of objects.
> ++
> +This is especially useful in combination with the
> +`core.validateAbbrev` setting, or to get more future-proof hashes to
> +reference in the future in a repository whose number of objects is
> +expected to grow.

Maybe s/validateAbbrev/validateAbbrev = false/?

> +
> +core.validateAbbrev::
> +       If set to false (true by default) don't do any validation when
> +       printing abbreviated object names to see if they're really
> +       unique. This makes printing objects more performant at the
> +       cost of potentially printing object names that aren't unique
> +       within the current repository.

Good. I understand why I'd want to use it, and why not.

> ++
> +When printing abbreviated object names Git needs to look through the
> +local object store. This is an `O(log N)` operation assuming all the
> +objects are in a single pack file, but `X * O(log N)` given `X` pack
> +files, which can get expensive on some larger repositories.

This might be very close to too much information.

> ++
> +This setting changes that to `O(1)`, but with the trade-off that
> +depending on the value of `core.abbrev` we may be printing abbreviated
> +hashes that collide. Too see how likely this is, try running:
> ++
> +-----------------------------------------------------------------------------------------------------------
> +git log --all --pretty=format:%h --abbrev=4 | perl -nE 'chomp; say length' | sort | uniq -c | sort -nr
> +-----------------------------------------------------------------------------------------------------------
> ++
> +This shows how many commits were found at each abbreviation length. On
> +linux.git in June 2018 this shows a bit more than 750,000 commits,
> +with just 4 needing 11 characters to be fully abbreviated, and the
> +default heuristic picks a length of 12.

These last few paragraphs seem like too much to me.

> ++
> +Even without `core.validateAbbrev=false` the results abbreviation
> +already a bit of a probability game. They're guaranteed at the moment
> +of generation, but as more objects are added, ambiguities may be
> +introduced. Likewise, what's unambiguous for you may not be for
> +somebody else you're communicating with, if they have their own clone.

This seems more useful.

> ++
> +Therefore the default of `core.validateAbbrev=true` may not save you
> +in practice if you're sharing the SHA-1 or noting it now to use after
> +a `git fetch`. You may be better off setting `core.abbrev` to
> +e.g. `+2` to add 2 extra characters to the SHA-1, and possibly combine
> +that with `core.validateAbbrev=false` to get a reasonable trade-off
> +between safety and performance.

Makes sense. As before, I'd suggest s/SHA-1/object ID/.

I do wonder if this documentation could be a bit less verbose without
sacrificing too much correctness and clarity. No brilliant suggestions
on how to do that, sorry.

Martin