On Tue, Sep 01, 2020 at 04:35:46PM +0200, SZEDER Gábor wrote: > On Tue, Aug 11, 2020 at 04:52:07PM -0400, Taylor Blau wrote: > > This is already undesirable, since it means that we are wasting > > considerable effort to discover that a commit with too many changed > > paths, only to throw that effort away (and then repeat the process the > > next time a roll-up is performed). > > Is this really the case? > > If a commit has a corresponding entry in the Bloom filters chunks, > then the commit-graph machinery does know the Bloom filter associated > with that commit. The size of that filter should not matter, i.e. > even if it is a zero-length filter, the commit-graph machinery should > know about it all the same. And as far as I can tell it does indeed, > because load_bloom_filter_from_graph() sets a non-NULL 'filter->data' > pointer even if 'filter->len' is zero, which get_bloom_filter() treats > as "we know about this", and returns early without (re)computing the > filter. Even the test 'Bloom generation does not recompute too-large > filters' added in this patch is expected to succeed, and my > superficial and makeshift testing seems to corroborate this; at least > I couldn't find a combination of commands and options that would > recompute any existing zero-length Bloom filters. > > Am I missing something? I'm not sure that this would work, or at least that it creates another pair of cases that we have to disambiguate. Here's what I'm thinking after reading what you wrote: - By this point in the series, we expect every commit in a commit-graph layer to have a corresponding entry in the 'BIDX' chunk (if one exists, obviously all of this is meaningless without specifying '--changed-paths'). - In a future commit (specifically, "builtin/commit-graph.c: introduce '--max-new-filters=<n>'"), this will no longer be the case. - So, by that point in the series, we would have two possible reasons why a commit did not show up in the BIDX. Either: * The commit has too many changed paths, in which case we don't want to write it down, or * We had already computed too many new filters, and so didn't have the budget to compute the filter for some commit(s). In the first of those two cases, we want to avoid recomputing the too-large filter. But, in the latter case, we *do* want to compute the filter from scratch, since we want to "backfill" commits with missing filters by letting them trickle in over time. I might be missing something, too, but I think that those two cases are just as indistinguishable as what's in the commit-graph today (without the 'BFXL' chunk). > > To address this, add a new chunk which encodes a bitmap where the ith > > bit is on iff the ith commit has zero or at least 512 changed paths. > > Likewise store the maximum number of changed paths we are willing to > > store in order to prepare for eventually making this value more easily > > customizable. > > I don't understand how storing this value would make it any easier to > customize. It doesn't, but it's also not meant to. The idea is that if the value ever were easy to customize, then we could keep track of which layers were written with which threshold. This would allow you to invent fancy rules like dropping filters when the threshold lowers, or recomputing ones that you wouldn't have otherwise when it goes up. > > Another approach would be to introduce a new BIDX chunk (say, one > > identified by 'BID2') which is identical to the existing BIDX chunk, > > except the most-significant bit of each offset is interpreted as "this > > filter is too big" iff looking at a BID2 chunk. This avoids having to > > write a bitmap, but forces older clients to rewrite their commit-graphs > > (as well as reduces the theoretical largest Bloom filters we couldl > > And it reduces the max possible size of the BDAT chunk, and thus the > max number of commits with Bloom filters as well. Fair point. > s/couldl/could/ Oops, thanks. > > write, and forces us to maintain the code necessary to translate BIDX > > chunks to BID2 ones). Separately from this patch, I implemented this > > alternate approach and did not find it to be advantageous. > > Let's take a step back to reconsider what should be stored in this > bitmap for a moment. Sure, setting a bit for each commit that doesn't > modify any paths or modifies too many makes it possible to repliably > identify commits that don't have Bloom filters yet. But isn't it a > bit roundabout way? I think it would be better to directly track > which commits don't have Bloom filters yet. IOW what you really want > is a, say, BNCY "Bloom filter Not Computed Yet" chunk, where we set > the corresponding bit for each commit which has an entry in the BIDX > chunk but for which a Bloom filter hasn't been computed yet. > > - It's simpler and easier to explain (IMO). > > - This bitmap chunk can easily be made optional: if all Bloom > filters have been computed, then the bitmap will contain all > zeros. So why bother writing it, when we can save a bit of space > instead? I don't think that this is true. Omitting the chunk (because you have computed--or tried to compute--filters for every commit in the graph) isn't distinguishable from what exists today, so we are back at square one there. > - It avoids the unpleasentness of setting a bit in the _Large_ Bloom > Filters chunks for commits _not_ modifying any paths. I agree it's unpleasant, but I also don't think it's a show-stopper. > - Less incentive to spill implementation details to the format > specification (e.g. 512 modified paths). > > Now, let's take another step back: is such a bitmap really necessary? > We could write a single-byte Bloom filter with no bits set for commits > not modifying any paths, and a single-byte Bloom filter with all bits > set for commits modifying too many paths. This is compatible with the > specs and any existing implementation should do the right thing when > reading such filters, this would allow us to interpret zero-length > filters as "not computed yet", and if that bitmap chunk won't be > optional, then this would save space as long as less than 1/8 of > commits modify no or too many paths. Unfortunately, however, all > existing zero-length Bloom filters have to be recomputed. I think this is a semantic difference: either you store a bitmap, or a Bloom filter containing the same data. To me, I don't think there's a huge difference, since we're talking about 1 bit per commit. If we were really worried, we could store them as EWAH-compressed bitmaps, but I don't get a sense that such a concern exists. I do feel strongly about using a non-probabilistic data structure, though, since the point of this feature is to be able to make tighter guarentees about the runtime of 'git commit-graph write'. > > > Signed-off-by: Taylor Blau <me@xxxxxxxxxxxx> > > --- > > .../technical/commit-graph-format.txt | 12 +++ > > bloom.h | 2 +- > > commit-graph.c | 96 ++++++++++++++++--- > > commit-graph.h | 4 + > > t/t4216-log-bloom.sh | 25 ++++- > > 5 files changed, 124 insertions(+), 15 deletions(-) > > > > diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt > > index 440541045d..5f2d9ab4d7 100644 > > --- a/Documentation/technical/commit-graph-format.txt > > +++ b/Documentation/technical/commit-graph-format.txt > > @@ -123,6 +123,18 @@ CHUNK DATA: > > of length zero. > > * The BDAT chunk is present if and only if BIDX is present. > > > > + Large Bloom Filters (ID: {'B', 'F', 'X', 'L'}) [Optional] > > + * It starts with a 32-bit unsigned integer specifying the maximum number of > > + changed-paths that can be stored in a single Bloom filter. > > Should this value be the same in all elements of a commit-graph chain? > Note that in this case having different values won't affect revision > walks using modified path Bloom filters. I don't think it needs to be the same necessarily. > > + * It then contains a list of 64-bit words (the length of this list is > > + determined by the width of the chunk) which is a bitmap. The 'i'th bit is > > + set exactly when the 'i'th commit in the graph has a changed-path Bloom > > + filter with zero entries (either because the commit is empty, or because > > + it contains more than 512 changed paths). > > Please make clear the byte order of these 64 bit words in the specs as > well. > > Furthermore, that 512 path limit is an implementation detail, so it > would be better if it didn't leak into the specification of this new > chunk. Addressed both, thanks. Thanks, Taylor