On Sun, Oct 7, 2018 at 1:07 PM Junio C Hamano <gitster@xxxxxxxxx> wrote: > > Junio C Hamano <gitster@xxxxxxxxx> writes: > > ... > > by general public and I do not have to explain the choice to the > > general public ;-) > > One thing that is more important than "why not 00 but 17?" to answer > is why a hardcoded number rather than a runtime random. It is for > repeatability. Let's talk about repeatability vs statistics for a second. ;-) If I am a user and I were really into optimizing my loose object count for some reason, so I would want to choose a low number of gc.auto. Let's say I go with 128. At the low end of loose objects the approximation is yielding some high relative errors. This is because of the granularity, i.e. gc would implicitly estimate the loose objects to be 0 or 256 or 512, (or more) if there is 0, 1, 2 (or more) loose objects in the objects/17. As each object can be viewed following an unfair coin flip (With a chance of 1/256 it is in objects/17), the distribution in objects/17 (and hence any other objects/XX bin) follows the Bernoulli distribution. If I do have say about 157 loose objects (and having auto.gc configured anywhere in 1..255), then the probability to not gc is 54% (as that is the probability to have 0 objects in /17, following probability mass function of the Bernoulli distribution, (i.e. Pr(0 objects) = (157 over 0) x (1/256)^0 x (255/256)^157)) As it is repeatable (by picking the same /17 every time), I can run "gc --auto" multiple times and still have 157 loose objects, despite wanting to have only 128 loose objects at a 54% chance. If we'd roll the 256 dice every time to pick a different bin, then we might hit another bin and gc in the second or third gc, which would be more precise on average. By having repeatability we allow for these numbers to be far off more often when configuring small numbers. I think that is the right choice, as we probably do not care about the exactness of auto-gc for small numbers, as it is a performance thing anyway. Although documenting it properly might be a challenge. The current wording of auto.gc seems to suggest that we are right for the number as we compute it via the implying the expected value, (i.e. we pick a bin and multiply the fullness of the bin by the number of bins to estimate the whole fullness, see the mean=n p on [1]) I think a user would be far more interested in giving an upper bound, i.e. expressing something like "I will have at most $auto.gc objects before gc kicks in" or "The likelihood to exceed the $auto.gc number of loose objects by $this much is less than 5%", for which the math would be more complicated, but easier to document with the words of statistics. [1] https://en.wikipedia.org/wiki/Binomial_distribution Stefan