On 6/10/2021 1:22 PM, Taylor Blau wrote: > On Thu, Jun 10, 2021 at 12:40:33PM +0200, Ævar Arnfjörð Bjarmason wrote: ... >> Reading the code there seems to be no way to do that, and we have the >> "chunk_bloom_data" in the graph, as well as "bloom_filter_settings". >> >> I'd expect some way to combine the "max_new_filters" and --split with >> some eventual-consistency logic so that graphs not matching our current >> settings are replaced, or replaced some <limit> at a time. > > This is asking about something slightly different, Bloom filter > settings rather than the existence of chagned-path Bloom filters > themselves. The Bloom settings aren't written to the commit-graph > although there has been some discussion about doing this in the past. Some of the settings are included, but not the "maximum size" of a filter. Thus, if that maximum size changes we do not have a way to determine if a missing filter is larger than that limit or not. Further, the existing filters might be larger than the new maximum which means we would need to check if some of those filters should be dropped. Here is the spec of the BDAT chunk: Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional] * It starts with header consisting of three unsigned 32-bit integers: - Version of the hash algorithm being used. We currently only support value 1 which corresponds to the 32-bit version of the murmur3 hash implemented exactly as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double hashing technique using seed values 0x293ae76f and 0x7e646e2 as described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters in Probabilistic Verification" - The number of times a path is hashed and hence the number of bit positions that cumulatively determine whether a file is present in the commit. - The minimum number of bits 'b' per entry in the Bloom filter. If the filter contains 'n' entries, then the filter size is the minimum number of 64-bit words that contain n*b bits. * The rest of the chunk is the concatenation of all the computed Bloom filters for the commits in lexicographic order. > If we ever did encode the Bloom settings, I imagine that accomplishing a > sort of "eventually replace all changed-path Bloom filters with these > new settings" would be as simple as considering all filters computed > under different settings to be "uncomputed". > >> Also, am I reading the expire_commit_graphs() logic correctly that we >> first write the split graph, and then unlink() things that are too old? >> I.e. if you rely on the commit-graph to optimize things this will make >> things slower until the next run of writing the graph? > > Before expire_commit_graphs() is called, we call mark_commit_graphs() > which freshens the mtimes of all surviving commit-graph layers, and then > expire_commit_graphs() removes the stale layers. I'm not sure what > things getting slower is referring to since the resulting commit-graph > has at least as many commits as the commit-graph that existed prior to > the write. > >> I expected to find something more gentle there [...] > > FWIW, I also find this "expire based on mtimes" thing a little odd for > writing split commit-graphs because we know exactly which layers we want > to get rid of. I suspect that the reuse comes from wanting to unify the > logic for handling '--expire-time' with the expiration that happens > after writing a split commit-graph that merged two or more previous > layers. > > I would probably change mark_commit_graphs() to remove those merged > layers explicitly (but still run expire_commit_graphs() to handle > --expire-time). But, come to think of it... if merging >2 layers already > causes the merged layers to be removed, then why would you ever set an > --expire-time yourself? The --expire-time is intended to leave the old layers around a while so concurrent processes who already parsed the commit-graph-chain file can discover the layers it referenced. It's not a perfect mechanism, so there is room for improvement here. Thanks, -Stolee