On Wed, Jul 26, 2023 at 01:39:14PM -0700, Junio C Hamano wrote: > Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes: > > > Jonathan Tan (4): > > gitformat-commit-graph: describe version 2 of BDAT > > t4216: test changed path filters with high bit paths > > repo-settings: introduce commitgraph.changedPathsVersion > > commit-graph: new filter ver. that fixes murmur3 > > > > Taylor Blau (3): > > t/helper/test-read-graph.c: extract `dump_graph_info()` > > bloom.h: make `load_bloom_filter_from_graph()` public > > t/helper/test-read-graph: implement `bloom-filters` mode > > After a week, nobody seems to have found anything worth saying about > these patches. Does the design, especially the migration story, now > look sensible to everybody? I was contemplating to mark the topic > for 'next' after reading them myself once more. Sorry for not getting to this sooner. I didn't notice anything during my review, but I think there may be a bug here. Suppose that we have an existing commit-graph with v1 Bloom filters. If we then try to rewrite that commit-graph using v2 Bloom filters, we *should* attempt to recompute the filter from scratch. But AFAICT, that isn't what happens. Here's my test setup: test_expect_success 'test' ' test_commit base && git repack -d && git -c commitGraph.changedPathVersion=1 commit-graph write --changed-paths && debug git -c commitGraph.changedPathVersion=2 commit-graph write --changed-paths ' if you attach a debugger to the second process, and break inside of get_or_compute_bloom_filter() when compute_if_not_present is set, you'll see that Git will pass along the existing *v1* Bloom filter, and then write its contents to the new commit-graph: (gdb) b get_or_compute_bloom_filter if compute_if_not_present Breakpoint 1 at 0x14340f: file bloom.c, line 260. (gdb) r Starting program: /home/ttaylorr/src/git/git -c commitGraph.changedPathVersion=2 commit-graph write --changed-paths [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Breakpoint 1, get_or_compute_bloom_filter (r=0x5555559bdc80 <the_repo>, c=0x5555559c8ef0, compute_if_not_present=1, settings=0x5555559c6950, computed=0x7fffffffd854) at bloom.c:260 260 if (computed) (gdb) until 271 get_or_compute_bloom_filter (r=0x5555559bdc80 <the_repo>, c=0x5555559c8ef0, compute_if_not_present=1, settings=0x5555559c6950, computed=0x7fffffffd854) at bloom.c:271 271 load_bloom_filter_from_graph(r->objects->commit_graph, (gdb) p *filter $2 = {data = 0x0, len = 0} (gdb) n 275 if (filter->data && filter->len) (gdb) p *filter $3 = {data = 0x7ffff7fc24a8 "\210\210\322a\267\234\214s}\004J\265\313\201\241\032e\312\034", len = 2} If I'm parsing this all correctly, Git used the v1 filter corresponding to that commit, and did not recompute a new one. I think that this could lead to incorrect results if you use this to masquerade a v1 Bloom filter as a v2 one. Since they use different implementations (one correct, one not) of murmur3, that opens us up to false negatives, at which point all bets are off. So I think we want to be more careful about when we load the existing Bloom data or not. I think we probably want to load it in all cases, but only read it when we have compatible Bloom settings. That should stop us from exhibiting this kind of bug, but it also gives us a handle on existing Bloom data if we wanted to copy forward existing results where we can. If all of this tracks, I think that there is a gap in our test coverage that didn't catch this earlier. Thanks, Taylor