Elijah Newren wrote: > On Mon, Jul 12, 2021 at 10:16 AM Jeff King <peff@xxxxxxxx> wrote: > > > * I think the median file size is a better predictor of rename > > > performance than mean file size, and median file size is ~2.5x smaller > > > than the mean[18]. > > > > There you might get hit with the quadratic-update thing again, though. > > The big files are more likely to be touched, so could be weighted more > > (though are they more likely to have been added/delete/renamed? Who > > knows). > > I'll agree that big files are more likely to be updated, but I don't > think renames are weighted towards bigger files. In fact, I wrote a > quick script to look at the sizes of all the renamed files in the > history of v2.6.25, and the mean (8034.1) and median (3866) of the > renamed files sizes in that history are comparable to the mean > (11150.3) and median (4198) of the files sizes in the v2.6.25 tree. > > I re-did the calculations using v5.5, and found that the mean > (12495.1) and median (3702) sizes of renames in all linux history up > to that point again were a bit less than the mean (13449.2) and median > (3860) file size of a file in the final v5.5 tree. > > Granted, this is a bit hand-wavy (what about creations or deletions? > Is there too much bias from the fact that I did rename sizes over all > history (due to needing enough to get statistics) while just grabbing > regular file sizes just in the end tree?), but I think it provides > pretty good first order approximation suggesting that mean/median > sizes of files involved in rename detection will be similar to the > mean/median sizes of other files within the relevant trees. > > > I don't think file size matters all _that_ much, though, as it has a > > linear relationship to time spent. Whereas the number of entries is > > quadratic. And of course the whole experiment is ball-parking in the > > first place. We're looking for order-of-magnitude approximations, I'd > > think. > > I agree that the number of entries is what's important; in fact, > that's why I think the median file size is more important than the > mean file size: That is almost always the case (except in unskewed distributions where the mean is equal to the median). Another option instead of an opaque configuration like 'renamelimit' --which is almost entirely arbitrary for most users--would be to have 'renamelevel'. A renamelevel of 5 would be the median, so that's already more meaningul than any value of renamelimit. A renamelevel of 9 would be the equivalent of the 9th decile, so that would catch 90% of renames. If the distribution follows a Pareto distribution (which is often the case), the formula to calculate the different deciles is trivial, but it would also be possible to hard-code all the different levels. -- Felipe Contreras