On Tue, Aug 01, 2023 at 02:08:50PM -0400, Taylor Blau wrote: > On Thu, Jul 27, 2023 at 01:53:08PM -0700, Jonathan Tan wrote: > > Taylor Blau <me@xxxxxxxxxxxx> writes: > > > > The intention in the current patch set was to not load it at all when we > > > > have incompatible Bloom settings because it appeared quite troublesome > > > > to notate which Bloom filter in memory is of which version. If we want > > > > to copy forward existing results, we can change that, but I don't know > > > > whether it's worth doing that (and if we think it's worth doing, this > > > > should probably go in another patch set). > > > > > > Yeah, I think having Bloom filters accessible from a commit-graph > > > regardless of whether or not it matches our Bloom filter version is > > > prerequisite to being able to implement something like this. > > > > > > I feel like this is important enough to do in the same patch set, or the > > > same release to avoid surprising operators when their commit-graph write > > > suddenly recomputes all of its Bloom filters. > > > > Suddenly reading many (or most) of the repo's trees would be a similar > > surprise, right? > > That's a good point. I think in general I'd expect Git to avoid > recomputing Bloom filters where that work can be avoided, if the work in > order to detect whether or not we need to recompute a filter is cheap > enough to carry out. I spent some time implementing this (patches are available in the branch 'tb/path-filter-fix-upgrade' from my fork). Handling incompatible Bloom filter versions is kind of tricky, but do-able without having to implement too much on top of what's already there. But I don't think that it's enough to say that we can reuse a commit's Bloom filter iff that commit's tree has no paths with characters >= 0x80. Suppose we have such a tree, whose Bloom filter we believe to be reusable. If its first parent *does* have such a path, then that path would appear as a deletion relative to its first parent. So that path *would* be in the filter, meaning that it isn't reusable. So I think the revised condition is something like: a commit's Bloom filter is reusable when there are no paths with characters >= 0x80 in a tree-diff against its first parent. I think that ensuring that there are no such paths in both a commit's root tree, as well as its first parent's root tree is equivalent, since that removes the possibility of such a path showing up in its tree-diff. As long as we aren't generating a commit-graph with --stdin-packs, then we process commits in generation order, so we will see parents before their children. I think we could reuse existing filters in that case, but the condition is slightly more complex than I originally thought. Thanks, Taylor