Re: What's so special about objects/17/ ?

Stefan Beller <sbeller@xxxxxxxxxx> · Mon, 8 Oct 2018 12:17:40 -0700

On Sun, Oct 7, 2018 at 1:07 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Junio C Hamano <gitster@xxxxxxxxx> writes:

> > ...
> > by general public and I do not have to explain the choice to the
> > general public ;-)
>
> One thing that is more important than "why not 00 but 17?" to answer
> is why a hardcoded number rather than a runtime random.  It is for
> repeatability.

Let's talk about repeatability vs statistics for a second. ;-)

If I am a user and I were really into optimizing my loose object count
for some reason, so I would want to choose a low number of
gc.auto. Let's say I go with 128.

At the low end of loose objects the approximation is yielding
some high relative errors. This is because of the granularity, i.e.
gc would implicitly estimate the loose objects to be 0 or 256 or 512, (or more)
if there is 0, 1, 2 (or more) loose objects in the objects/17.

As each object can be viewed following an unfair coin flip
(With a chance of 1/256 it is in objects/17), the distribution in
objects/17 (and hence any other objects/XX bin) follows the
Bernoulli distribution.

If I do have say about 157 loose objects (and having auto.gc
configured anywhere in 1..255), then the probability to not
gc is 54% (as that is the probability to have 0 objects in /17,
following probability mass function of the Bernoulli distribution,
(i.e. Pr(0 objects) = (157 over 0) x (1/256)^0 x (255/256)^157))

As it is repeatable (by picking the same /17 every time), I can run
"gc --auto" multiple times and still have 157 loose objects, despite
wanting to have only 128 loose objects at a 54% chance.

If we'd roll the 256 dice every time to pick a different bin,
then we might hit another bin and gc in the second or third
gc, which would be more precise on average.

By having repeatability we allow for these numbers to be far off
more often when configuring small numbers.

I think that is the right choice, as we probably do not care about the
exactness of auto-gc for small numbers, as it is a performance
thing anyway. Although documenting it properly might be a challenge.

The current wording of auto.gc seems to suggest that we are right
for the number as we compute it via the implying the expected value,
(i.e. we pick a bin and multiply the fullness of the bin by the number
of bins to estimate the whole fullness, see the mean=n p on [1])
I think a user would be far more interested in giving an upper bound,
i.e. expressing something like "I will have at most $auto.gc objects
before gc kicks in" or "The likelihood to exceed the $auto.gc number
of loose objects by $this much is less than 5%", for which the math
would be more complicated, but easier to document with the words of
statistics.

[1]  https://en.wikipedia.org/wiki/Binomial_distribution

Stefan