On 9/10/2020 11:45 AM, Taylor Blau wrote: > On Wed, Sep 09, 2020 at 11:35:57PM -0400, Taylor Blau wrote: >> On Wed, Sep 09, 2020 at 11:23:48AM -0400, Taylor Blau wrote: >>> - filter->data = NULL; >>> - filter->len = 0; >>> + filter->data = xmalloc(1); >>> + filter->data[0] = 0xFF; >>> + filter->len = 1; >>> >>> if (computed) >>> *computed |= BLOOM_TRUNC_LARGE; >> >> Oops, I missed the case that added by the previous patch where the >> number of diff entries is smaller than the limit, but the hashmap >> entries (after directories are added and such) crosses the threshold. >> >> Specifically, this patch doesn't write the 0xFF filter like it should. >> I'll send a different version of this patch tomorrow. > > This one should do the trick. Let's use it instead. > > --- >8 --- > > Subject: [PATCH] bloom: encode out-of-bounds filters as non-empty > > When a changed-path Bloom filter has either zero, or more than a > certain number (commonly 512) of entries, the commit-graph machinery > encodes it as "missing". More specifically, it sets the indices adjacent > in the BIDX chunk as equal to each other to indicate a "length 0" > filter; that is, that the filter occupies zero bytes on disk. > > This has heretofore been fine, since the commit-graph machinery has no > need to care about these filters with too few or too many changed paths. > Both cases act like no filter has been generated at all, and so there is > no need to store them. > > In a subsequent commit, however, the commit-graph machinery will learn > to only compute Bloom filters for some commits in the current > commit-graph layer. This is a change from the current implementation > which computes Bloom filters for all commits that are in the layer being > written. Critically for this patch, only computing some of the Bloom > filters means adding a third state for length 0 Bloom filters: zero > entries, too many entries, or "hasn't been computed". > > It will be important for that future patch to distinguish between "not > representable" (i.e., zero or too-many changed paths), and "hasn't been > computed". In particular, we don't want to waste time recomputing > filters that have already been computed. > > To that end, change how we store Bloom filters in the "computed but not > representable" category: > > - Bloom filters with no entries are stored as a single byte with all > bits low (i.e., all queries to that Bloom filter will return > "definitely not") > > - Bloom filters with too many entries are stored as a single byte with > all bits set high (i.e., all queries to that Bloom filter will > return "maybe"). > > These rules are sufficient to not incur a behavior change by changing > the on-disk representation of these two classes. Likewise, no > specification changes are necessary for the commit-graph format, either: > > - Filters that were previously empty will be recomputed and stored > according to the new rules, and > > - old clients reading filters generated by new clients will interpret > the filters correctly and be none the wiser to how they were > generated. > > Clients will invoke the Bloom machinery in more cases than before, but > this can be addressed by returning a NULL filter when all bits are set > high. This can be addressed in a future patch. > > Finally, note that this does increase the size of on-disk commit-graphs, > but far less than other proposals. In particular, this is generally more > efficient than storing a bitmap for which commits haven't computed their > Bloom filters. Storing a bitmap incurs a penalty of one bit per commit, > whereas storing explicit filters as above incurs a penalty of one byte > per too-large or too-small commit. > > In practice, these boundary commits likely occupy a small proportion of > the overall number of commits, and so the size penalty is likely smaller > than storing a bitmap for all commits. > > A test to exercise filters which contain too many changed path entries > will be introduced in a subsequent patch. > > Suggested-by: SZEDER Gábor <szeder.dev@xxxxxxxxx> > Suggested-by: Jakub Narębski <jnareb@xxxxxxxxx> > Helped-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> > Signed-off-by: Taylor Blau <me@xxxxxxxxxxxx> > --- > .../technical/commit-graph-format.txt | 2 +- > bloom.c | 16 ++++++++++++-- > bloom.h | 1 + > commit-graph.c | 5 +++++ > t/t0095-bloom.sh | 8 +++---- > t/t4216-log-bloom.sh | 21 +++++++++++++++++-- > 6 files changed, 44 insertions(+), 9 deletions(-) > > diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt > index 6ddbceba15..6585f1948a 100644 > --- a/Documentation/technical/commit-graph-format.txt > +++ b/Documentation/technical/commit-graph-format.txt > @@ -125,7 +125,7 @@ CHUNK DATA: > * The rest of the chunk is the concatenation of all the computed Bloom > filters for the commits in lexicographic order. > * Note: Commits with no changes or more than 512 changes have Bloom filters > - of length zero. > + of length one, with either all bits set to zero or one respectively. > * The BDAT chunk is present if and only if BIDX is present. > > Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional] > diff --git a/bloom.c b/bloom.c > index db9fb82437..d24747a1d5 100644 > --- a/bloom.c > +++ b/bloom.c > @@ -177,6 +177,13 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data, > return strcmp(e1->path, e2->path); > } > > +static void init_truncated_large_filter(struct bloom_filter *filter) > +{ > + filter->data = xmalloc(1); > + filter->data[0] = 0xFF; > + filter->len = 1; > +} > + So this hunk is essentially the new bit in this version. > diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh > index 232ba2c485..7e4ab1795f 100755 > --- a/t/t0095-bloom.sh > +++ b/t/t0095-bloom.sh > @@ -71,8 +71,8 @@ test_expect_success 'get bloom filters for commit with no changes' ' > git init && > git commit --allow-empty -m "c0" && > cat >expect <<-\EOF && > - Filter_Length:0 > - Filter_Data: > + Filter_Length:1 > + Filter_Data:00| > EOF > test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual && > test_cmp expect actual > @@ -107,8 +107,8 @@ test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' ' > git add bigDir && > git commit -m "commit with 513 changes" && > cat >expect <<-\EOF && > - Filter_Length:0 > - Filter_Data: > + Filter_Length:1 > + Filter_Data:ff| > EOF > test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual && > test_cmp expect actual And these two tests now show the new behavior. Nice. This version matches my expectations. Thanks. -Stolee