Re: [PATCH v2 10/13] bloom: encode out-of-bounds filters as non-empty

SZEDER Gábor <szeder.dev@xxxxxxxxx> · Fri, 18 Sep 2020 00:13:02 +0200

On Wed, Sep 16, 2020 at 02:07:59PM -0400, Taylor Blau wrote:
> When a changed-path Bloom filter has either zero, or more than a
> certain number (commonly 512) of entries, the commit-graph machinery
> encodes it as "missing". More specifically, it sets the indices adjacent
> in the BIDX chunk as equal to each other to indicate a "length 0"
> filter; that is, that the filter occupies zero bytes on disk.
> 
> This has heretofore been fine, since the commit-graph machinery has no
> need to care about these filters with too few or too many changed paths.
> Both cases act like no filter has been generated at all, and so there is
> no need to store them.
> 
> In a subsequent commit, however, the commit-graph machinery will learn
> to only compute Bloom filters for some commits in the current
> commit-graph layer. This is a change from the current implementation
> which computes Bloom filters for all commits that are in the layer being
> written. Critically for this patch, only computing some of the Bloom
> filters means adding a third state for length 0 Bloom filters: zero
> entries, too many entries, or "hasn't been computed".
> 
> It will be important for that future patch to distinguish between "not
> representable" (i.e., zero or too-many changed paths), and "hasn't been
> computed". In particular, we don't want to waste time recomputing
> filters that have already been computed.
> 
> To that end, change how we store Bloom filters in the "computed but not
> representable" category:
> 
>   - Bloom filters with no entries are stored as a single byte with all
>     bits low (i.e., all queries to that Bloom filter will return
>     "definitely not")
> 
>   - Bloom filters with too many entries are stored as a single byte with
>     all bits set high (i.e., all queries to that Bloom filter will
>     return "maybe").
> 
> These rules are sufficient to not incur a behavior change by changing
> the on-disk representation of these two classes. Likewise, no
> specification changes are necessary for the commit-graph format, either:
> 
>   - Filters that were previously empty will be recomputed and stored
>     according to the new rules, and
> 
>   - old clients reading filters generated by new clients will interpret
>     the filters correctly and be none the wiser to how they were
>     generated.
> 
> Clients will invoke the Bloom machinery in more cases than before, but
> this can be addressed by returning a NULL filter when all bits are set
> high. This can be addressed in a future patch.

OTOH, clients will invoke the tree-diff machinery in fewer cases than
before, because querying the Bloom filter of commits not modifying any
files will now return "definitely not".

> Finally, note that this does increase the size of on-disk commit-graphs,
> but far less than other proposals. In particular, this is generally more
> efficient than storing a bitmap for which commits haven't computed their
> Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
> whereas storing explicit filters as above incurs a penalty of one byte
> per too-large or too-small commit.

s/too-small/empty/

> In practice, these boundary commits likely occupy a small proportion of
> the overall number of commits, and so the size penalty is likely smaller
> than storing a bitmap for all commits.

                 |      Percentage of
                 |    commits modifying
                 |   0 path   |  >= 512 paths
  ---------------+------------+----------------
  android-base   |   13.20%   |   0.13%
  cmssw          |    0.15%   |   0.23%
  cpython        |    3.07%   |   0.01%
  elasticsearch  |    0.70%   |   1.00%
  gcc            |    0.00%   |   0.08%
  gecko-dev      |    0.14%   |   0.64%
  git            |    0.11%   |   0.02%
  glibc          |    0.02%   |   0.10%
  go             |    0.00%   |   0.07%
  homebrew-cask  |    0.40%   |   0.02%
  homebrew-core  |    0.01%   |   0.01%
  jdk            |    0.26%   |   5.64%
  linux          |    0.01%   |   0.51%
  llvm-project   |    0.12%   |   0.03%
  rails          |    0.10%   |   0.10%
  rust           |    0.07%   |   0.17%
  tensorflow     |    0.09%   |   1.02%
  webkit         |    0.05%   |   0.31%

> A test to exercise filters which contain too many changed path entries
> will be introduced in a subsequent patch.

> diff --git a/bloom.h b/bloom.h
> index c6d77e8393..70a8840896 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -93,6 +93,7 @@ enum bloom_filter_computed {
>  	BLOOM_NOT_COMPUTED = (1 << 0),
>  	BLOOM_COMPUTED     = (1 << 1),
>  	BLOOM_TRUNC_LARGE  = (1 << 2),
> +	BLOOM_TRUNC_SMALL  = (1 << 3),

s/SMALL/EMPTY/

This "small" suffix in the constant, variable, and trace2 key names is
misleading, because we only mean empty commits.

>  };
>  
>  struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
> diff --git a/commit-graph.c b/commit-graph.c
> index 1ca754f19c..bd4247bca5 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -974,6 +974,7 @@ struct write_commit_graph_context {
>  
>  	int count_bloom_filter_computed;
>  	int count_bloom_filter_not_computed;
> +	int count_bloom_filter_trunc_small;
>  	int count_bloom_filter_trunc_large;
>  };
>