Taylor Blau <me@xxxxxxxxxxxx> writes: > On Thu, Jul 27, 2023 at 01:53:08PM -0700, Jonathan Tan wrote: > > Suddenly reading many (or most) of the repo's trees would be a similar > > surprise, right? > > That's a good point. I think in general I'd expect Git to avoid > recomputing Bloom filters where that work can be avoided, if the work in > order to detect whether or not we need to recompute a filter is cheap > enough to carry out. Makes sense. I just don't think that there is a cheap way to detect if a filter does not need to be recomputed (the closest way I think we have is something that will require reading a lot of trees in a repo). > > Also this would happen only if the server operator explicitly sets a > > changed path filter version. If they leave it as-is, commit graphs will > > still be written with the same version as the one on disk. > > I think that I could live with that if the default is to leave things > as-is. Ah, thanks. > I still think that it's worth it to have this functionality to propagate > Bloom filters forward should ship in Git, but we can work on that > outside of this series. Makes sense. > > Regarding consulting commit_graph->bloom_filter_settings->hash_version, > > the issue I ran into was that firstly, I didn't know what to do about > > commit_graph->base_graph (which also has its own bloom_filter_settings) > > and what to do if it had a contradictory hash_version. And even if > > we found a way to unify those, it is not true that every Bloom filter > > in memory is of that version, since we may have generated some that > > correspond to the version we're writing (not the version on disk). > > In particular, the Bloom filters we write come from a commit slab > > (bloom_filters in bloom.c) and in that slab, both Bloom filters from > > disk and Bloom filters that were generated in-process coexist. > > Would we ever want to have a commit-graph chain with mixed Bloom filter > versions? Probably not, but I wanted to be robust in case a third-party tool wrote a chain with mixed versions. > We avoid mixing generation number schemes across multiple layers of a > commit-graph chain. But I don't see any reason that we should or need to > have a similar restriction in place for the Bloom filter version. Both > are readable, as long as the user-provided configuration allows them to > be. Yes, that's true - there is no inherent reason why we can't mix them, unlike with generation numbers. > We just have to interpret them differently depending on what layer of > the graph (and therefore, what Bloom filter version they are) they come > from. > > Sorry for thinking aloud a little there. I think that this means that we > at minimum have to keep in context the commit-graph layer we found the > Bloom filter in so that we can tie that back to its Bloom filter > version. That might just mean that we have to tag each Bloom filter with > the version it was computed under, or maybe we already have the > commit-graph layer in context, in which case we shouldn't need an > additional field. > > My gut is telling me that we probably *do* need such a field, since we > don't often have a reference to the particular layer that we found a > Bloom filter in, just the tip of the commit-graph chain that it came > from. We'll need the additional field because we don't know which commit graph layer it comes from. In fact, we don't even know which *repo* the commit comes from, since the commit slab is global. (Moving the slab to being under a repo or under a commit graph layer would fix this.) But I think there still remains the question of whether we really need to support multiple versions in one Git invocation. > > I also thought of your other proposal of augmenting struct bloom_filter > > to also include the version. The issue I ran into there is if, for a > > given commit, there already exists a Bloom filter read from disk with > > the wrong version, what should we do? Overwrite it, or store both > > versions in memory? (We can't just immediately output the Bloom filter > > to disk and forget about the new version, only storing its size so that > > we can generate the BIDX, because in the current code, generation and > > writing to disk are separate. We could try to refactor it, but I didn't > > want to make such a large change to reduce the possibility of bugs.) > > Both storing the version number and storing an additional pointer for a > > second version would increase memory consumption too, even when > > supporting two versions isn't needed, but maybe this isn't a big deal. > > It's likely that I'm missing something here, but what is stopping us > from discarding the old Bloom filter as soon as we generate the new > one? We shouldn't need to load the old filter again out of the commit > slab, right? > > Thanks, > Taylor I did not look at the code closely enough except to see that there was a gap between generating the new-version Bloom filters and writing them to disk, and I was concerned that now, or in the future, there would be some code in that gap that reads the Bloom filter for a commit and expects the old-version Bloom filter there.