On Thu, Jun 10, 2021 at 12:40:33PM +0200, Ævar Arnfjörð Bjarmason wrote: > Is there any way with an existing --split setup that introduces a > --changed-paths to make the "add bloom filters to the graph" eventually > consistent, or is some one-off --split=replace the only way to > grandfather in such a feature? I'm assuming what you mean is "can I introduce changed-path Bloom filters into an existing split commit-graph with many layers without having to recompute the whole thing at once?" If so, then the answer is yes. Passing --changed-paths causes the commit-graph machinery to compute missing Bloom filters for every commit in the graph layer it is writing. So, if you do something like: git commit-graph write --split --reachable --size-multiple=2 \ --changed-paths (--size-multiple=2 is the default, but I'm including it for clarity), then you'll get changed-path Bloom filters for all commits in the new layer, including any layers which may have been merged to create that layer. That all still respects `--max-new-filters`, with preference going to commits with lower generation numbers before higher ones (unless including commits from packs explicitly with --stdin-packs, in which case preference is given in pack order; see commit-graph.c:commit_pos_cmp() for details). t4216 shows this for --split=replace, but you could just as easily imagine a test like this: #!/bin/sh rm -fr repo git init repo cd repo commit () { >$1 git add $1 git commit -m "$1" } # no changed-path Bloom filter commit missing git commit-graph write --split --reachable --no-changed-paths missing="$(git rev-parse HEAD)" ~/src/git/t/helper/test-tool bloom get_filter_for_commit "$missing" # >= 2x the size of the previous layer, so they will be merged commit bloom1 commit bloom2 git commit-graph write --split --reachable --changed-paths # and the $missing commit has a Bloom filter ~/src/git/t/helper/test-tool bloom get_filter_for_commit "$missing" (One caveat is that if you run that script unmodified, you'll get a filter on both invcations of the test-tool: that's because it computes filters on the fly if they are missing. You can change that by s/1/0 in the call to get_or_compute_bloom_filter()). > Reading the code there seems to be no way to do that, and we have the > "chunk_bloom_data" in the graph, as well as "bloom_filter_settings". > > I'd expect some way to combine the "max_new_filters" and --split with > some eventual-consistency logic so that graphs not matching our current > settings are replaced, or replaced some <limit> at a time. This is asking about something slightly different, Bloom filter settings rather than the existence of chagned-path Bloom filters themselves. The Bloom settings aren't written to the commit-graph although there has been some discussion about doing this in the past. If we ever did encode the Bloom settings, I imagine that accomplishing a sort of "eventually replace all changed-path Bloom filters with these new settings" would be as simple as considering all filters computed under different settings to be "uncomputed". > Also, am I reading the expire_commit_graphs() logic correctly that we > first write the split graph, and then unlink() things that are too old? > I.e. if you rely on the commit-graph to optimize things this will make > things slower until the next run of writing the graph? Before expire_commit_graphs() is called, we call mark_commit_graphs() which freshens the mtimes of all surviving commit-graph layers, and then expire_commit_graphs() removes the stale layers. I'm not sure what things getting slower is referring to since the resulting commit-graph has at least as many commits as the commit-graph that existed prior to the write. > I expected to find something more gentle there [...] FWIW, I also find this "expire based on mtimes" thing a little odd for writing split commit-graphs because we know exactly which layers we want to get rid of. I suspect that the reuse comes from wanting to unify the logic for handling '--expire-time' with the expiration that happens after writing a split commit-graph that merged two or more previous layers. I would probably change mark_commit_graphs() to remove those merged layers explicitly (but still run expire_commit_graphs() to handle --expire-time). But, come to think of it... if merging >2 layers already causes the merged layers to be removed, then why would you ever set an --expire-time yourself? Thanks, Taylor