On Wed, Sep 16, 2020 at 02:07:59PM -0400, Taylor Blau wrote: > When a changed-path Bloom filter has either zero, or more than a > certain number (commonly 512) of entries, the commit-graph machinery > encodes it as "missing". More specifically, it sets the indices adjacent > in the BIDX chunk as equal to each other to indicate a "length 0" > filter; that is, that the filter occupies zero bytes on disk. > > This has heretofore been fine, since the commit-graph machinery has no > need to care about these filters with too few or too many changed paths. > Both cases act like no filter has been generated at all, and so there is > no need to store them. > > In a subsequent commit, however, the commit-graph machinery will learn > to only compute Bloom filters for some commits in the current > commit-graph layer. This is a change from the current implementation > which computes Bloom filters for all commits that are in the layer being > written. Critically for this patch, only computing some of the Bloom > filters means adding a third state for length 0 Bloom filters: zero > entries, too many entries, or "hasn't been computed". > > It will be important for that future patch to distinguish between "not > representable" (i.e., zero or too-many changed paths), and "hasn't been > computed". In particular, we don't want to waste time recomputing > filters that have already been computed. > > To that end, change how we store Bloom filters in the "computed but not > representable" category: > > - Bloom filters with no entries are stored as a single byte with all > bits low (i.e., all queries to that Bloom filter will return > "definitely not") > > - Bloom filters with too many entries are stored as a single byte with > all bits set high (i.e., all queries to that Bloom filter will > return "maybe"). > > These rules are sufficient to not incur a behavior change by changing > the on-disk representation of these two classes. Likewise, no > specification changes are necessary for the commit-graph format, either: > > - Filters that were previously empty will be recomputed and stored > according to the new rules, and > > - old clients reading filters generated by new clients will interpret > the filters correctly and be none the wiser to how they were > generated. > > Clients will invoke the Bloom machinery in more cases than before, but > this can be addressed by returning a NULL filter when all bits are set > high. This can be addressed in a future patch. OTOH, clients will invoke the tree-diff machinery in fewer cases than before, because querying the Bloom filter of commits not modifying any files will now return "definitely not". > Finally, note that this does increase the size of on-disk commit-graphs, > but far less than other proposals. In particular, this is generally more > efficient than storing a bitmap for which commits haven't computed their > Bloom filters. Storing a bitmap incurs a penalty of one bit per commit, > whereas storing explicit filters as above incurs a penalty of one byte > per too-large or too-small commit. s/too-small/empty/ > In practice, these boundary commits likely occupy a small proportion of > the overall number of commits, and so the size penalty is likely smaller > than storing a bitmap for all commits. | Percentage of | commits modifying | 0 path | >= 512 paths ---------------+------------+---------------- android-base | 13.20% | 0.13% cmssw | 0.15% | 0.23% cpython | 3.07% | 0.01% elasticsearch | 0.70% | 1.00% gcc | 0.00% | 0.08% gecko-dev | 0.14% | 0.64% git | 0.11% | 0.02% glibc | 0.02% | 0.10% go | 0.00% | 0.07% homebrew-cask | 0.40% | 0.02% homebrew-core | 0.01% | 0.01% jdk | 0.26% | 5.64% linux | 0.01% | 0.51% llvm-project | 0.12% | 0.03% rails | 0.10% | 0.10% rust | 0.07% | 0.17% tensorflow | 0.09% | 1.02% webkit | 0.05% | 0.31% > A test to exercise filters which contain too many changed path entries > will be introduced in a subsequent patch. > diff --git a/bloom.h b/bloom.h > index c6d77e8393..70a8840896 100644 > --- a/bloom.h > +++ b/bloom.h > @@ -93,6 +93,7 @@ enum bloom_filter_computed { > BLOOM_NOT_COMPUTED = (1 << 0), > BLOOM_COMPUTED = (1 << 1), > BLOOM_TRUNC_LARGE = (1 << 2), > + BLOOM_TRUNC_SMALL = (1 << 3), s/SMALL/EMPTY/ This "small" suffix in the constant, variable, and trace2 key names is misleading, because we only mean empty commits. > }; > > struct bloom_filter *get_or_compute_bloom_filter(struct repository *r, > diff --git a/commit-graph.c b/commit-graph.c > index 1ca754f19c..bd4247bca5 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -974,6 +974,7 @@ struct write_commit_graph_context { > > int count_bloom_filter_computed; > int count_bloom_filter_not_computed; > + int count_bloom_filter_trunc_small; > int count_bloom_filter_trunc_large; > }; >