On 8/2/2023 8:01 PM, Taylor Blau wrote: > On Tue, Aug 01, 2023 at 02:08:50PM -0400, Taylor Blau wrote: >> That's a good point. I think in general I'd expect Git to avoid >> recomputing Bloom filters where that work can be avoided, if the work in >> order to detect whether or not we need to recompute a filter is cheap >> enough to carry out. > > I spent some time implementing this (patches are available in the branch > 'tb/path-filter-fix-upgrade' from my fork). Handling incompatible Bloom > filter versions is kind of tricky, but do-able without having to > implement too much on top of what's already there. > > But I don't think that it's enough to say that we can reuse a commit's > Bloom filter iff that commit's tree has no paths with characters >= > 0x80. Suppose we have such a tree, whose Bloom filter we believe to be > reusable. If its first parent *does* have such a path, then that path > would appear as a deletion relative to its first parent. So that path > *would* be in the filter, meaning that it isn't reusable. > > So I think the revised condition is something like: a commit's Bloom > filter is reusable when there are no paths with characters >= 0x80 in > a tree-diff against its first parent. I think that ensuring that there > are no such paths in both a commit's root tree, as well as its first > parent's root tree is equivalent, since that removes the possibility of > such a path showing up in its tree-diff. This condition is exactly "we computed the diff to know which paths were input to the filter" which is as difficult as recomputing the Bloom filter from scratch. I don't think there is much room to gain a performance improvement here. Thanks, -Stolee