Re: [PATCH v6 0/7] Changed path filter hash fix and version bump

Taylor Blau <me@xxxxxxxxxxxx> · Wed, 26 Jul 2023 20:17:20 -0400

On Wed, Jul 26, 2023 at 01:39:14PM -0700, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:
>
> > Jonathan Tan (4):
> >   gitformat-commit-graph: describe version 2 of BDAT
> >   t4216: test changed path filters with high bit paths
> >   repo-settings: introduce commitgraph.changedPathsVersion
> >   commit-graph: new filter ver. that fixes murmur3
> >
> > Taylor Blau (3):
> >   t/helper/test-read-graph.c: extract `dump_graph_info()`
> >   bloom.h: make `load_bloom_filter_from_graph()` public
> >   t/helper/test-read-graph: implement `bloom-filters` mode
>
> After a week, nobody seems to have found anything worth saying about
> these patches.  Does the design, especially the migration story, now
> look sensible to everybody?  I was contemplating to mark the topic
> for 'next' after reading them myself once more.

Sorry for not getting to this sooner. I didn't notice anything during my
review, but I think there may be a bug here.

Suppose that we have an existing commit-graph with v1 Bloom filters. If
we then try to rewrite that commit-graph using v2 Bloom filters, we
*should* attempt to recompute the filter from scratch. But AFAICT, that
isn't what happens.

Here's my test setup:

    test_expect_success 'test' '
      test_commit base &&
      git repack -d &&

      git -c commitGraph.changedPathVersion=1 commit-graph write --changed-paths &&
      debug git -c commitGraph.changedPathVersion=2 commit-graph write --changed-paths
    '

if you attach a debugger to the second process, and break inside of
get_or_compute_bloom_filter() when compute_if_not_present is set, you'll
see that Git will pass along the existing *v1* Bloom filter, and then
write its contents to the new commit-graph:

    (gdb) b get_or_compute_bloom_filter if compute_if_not_present
    Breakpoint 1 at 0x14340f: file bloom.c, line 260.
    (gdb) r
    Starting program: /home/ttaylorr/src/git/git -c commitGraph.changedPathVersion=2 commit-graph write --changed-paths
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

    Breakpoint 1, get_or_compute_bloom_filter (r=0x5555559bdc80 <the_repo>, c=0x5555559c8ef0,
        compute_if_not_present=1, settings=0x5555559c6950, computed=0x7fffffffd854) at bloom.c:260
    260		if (computed)
    (gdb) until 271
    get_or_compute_bloom_filter (r=0x5555559bdc80 <the_repo>, c=0x5555559c8ef0, compute_if_not_present=1,
        settings=0x5555559c6950, computed=0x7fffffffd854) at bloom.c:271
    271				load_bloom_filter_from_graph(r->objects->commit_graph,
    (gdb) p *filter
    $2 = {data = 0x0, len = 0}
    (gdb) n
    275		if (filter->data && filter->len)
    (gdb) p *filter
    $3 = {data = 0x7ffff7fc24a8 "\210\210\322a\267\234\214s}\004J\265\313\201\241\032e\312\034", len = 2}

If I'm parsing this all correctly, Git used the v1 filter corresponding
to that commit, and did not recompute a new one.

I think that this could lead to incorrect results if you use this to
masquerade a v1 Bloom filter as a v2 one. Since they use different
implementations (one correct, one not) of murmur3, that opens us up to
false negatives, at which point all bets are off.

So I think we want to be more careful about when we load the existing
Bloom data or not. I think we probably want to load it in all cases, but
only read it when we have compatible Bloom settings. That should stop us
from exhibiting this kind of bug, but it also gives us a handle on
existing Bloom data if we wanted to copy forward existing results where
we can.

If all of this tracks, I think that there is a gap in our test coverage
that didn't catch this earlier.

Thanks,
Taylor