Re: [PATCH v6 0/7] Changed path filter hash fix and version bump

Derrick Stolee <derrickstolee@xxxxxxxxxx> · Thu, 3 Aug 2023 09:18:11 -0400

On 8/2/2023 8:01 PM, Taylor Blau wrote:
> On Tue, Aug 01, 2023 at 02:08:50PM -0400, Taylor Blau wrote:

>> That's a good point. I think in general I'd expect Git to avoid
>> recomputing Bloom filters where that work can be avoided, if the work in
>> order to detect whether or not we need to recompute a filter is cheap
>> enough to carry out.
> 
> I spent some time implementing this (patches are available in the branch
> 'tb/path-filter-fix-upgrade' from my fork). Handling incompatible Bloom
> filter versions is kind of tricky, but do-able without having to
> implement too much on top of what's already there.
> 
> But I don't think that it's enough to say that we can reuse a commit's
> Bloom filter iff that commit's tree has no paths with characters >=
> 0x80. Suppose we have such a tree, whose Bloom filter we believe to be
> reusable. If its first parent *does* have such a path, then that path
> would appear as a deletion relative to its first parent. So that path
> *would* be in the filter, meaning that it isn't reusable.
> 
> So I think the revised condition is something like: a commit's Bloom
> filter is reusable when there are no paths with characters >= 0x80 in
> a tree-diff against its first parent. I think that ensuring that there
> are no such paths in both a commit's root tree, as well as its first
> parent's root tree is equivalent, since that removes the possibility of
> such a path showing up in its tree-diff.

This condition is exactly "we computed the diff to know which paths were
input to the filter" which is as difficult as recomputing the Bloom filter
from scratch. I don't think there is much room to gain a performance
improvement here.

Thanks,
-Stolee