Re: [RFC] Bump {diff,merge}.renameLimit ?

Felipe Contreras <felipe.contreras@xxxxxxxxx> · Mon, 12 Jul 2021 15:58:53 -0500

Elijah Newren wrote:
> On Mon, Jul 12, 2021 at 10:16 AM Jeff King <peff@xxxxxxxx> wrote:

> > > * I think the median file size is a better predictor of rename
> > >   performance than mean file size, and median file size is ~2.5x smaller
> > >   than the mean[18].
> >
> > There you might get hit with the quadratic-update thing again, though.
> > The big files are more likely to be touched, so could be weighted more
> > (though are they more likely to have been added/delete/renamed? Who
> > knows).
> 
> I'll agree that big files are more likely to be updated, but I don't
> think renames are weighted towards bigger files.  In fact, I wrote a
> quick script to look at the sizes of all the renamed files in the
> history of v2.6.25, and the mean (8034.1) and median (3866) of the
> renamed files sizes in that history are comparable to the mean
> (11150.3) and median (4198) of the files sizes in the v2.6.25 tree.
> 
> I re-did the calculations using v5.5, and found that the mean
> (12495.1) and median (3702) sizes of renames in all linux history up
> to that point again were a bit less than the mean (13449.2) and median
> (3860) file size of a file in the final v5.5 tree.
> 
> Granted, this is a bit hand-wavy (what about creations or deletions?
> Is there too much bias from the fact that I did rename sizes over all
> history (due to needing enough to get statistics) while just grabbing
> regular file sizes just in the end tree?), but I think it provides
> pretty good first order approximation suggesting that mean/median
> sizes of files involved in rename detection will be similar to the
> mean/median sizes of other files within the relevant trees.
> 
> > I don't think file size matters all _that_ much, though, as it has a
> > linear relationship to time spent. Whereas the number of entries is
> > quadratic. And of course the whole experiment is ball-parking in the
> > first place. We're looking for order-of-magnitude approximations, I'd
> > think.
> 
> I agree that the number of entries is what's important; in fact,
> that's why I think the median file size is more important than the
> mean file size:

That is almost always the case (except in unskewed distributions where
the mean is equal to the median).

Another option instead of an opaque configuration like 'renamelimit'
--which is almost entirely arbitrary for most users--would be to have
'renamelevel'. A renamelevel of 5 would be the median, so that's already
more meaningul than any value of renamelimit.

A renamelevel of 9 would be the equivalent of the 9th decile, so that
would catch 90% of renames.

If the distribution follows a Pareto distribution (which is often the
case), the formula to calculate the different deciles is trivial, but it
would also be possible to hard-code all the different levels.

-- 
Felipe Contreras