"Garima Singh via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > From: Garima Singh <garima.singh@xxxxxxxxxxxxx> > Overly long lines in the commit message. > Read previously computed bloom filters from the commit-graph file if possible > to avoid recomputing during commit-graph write. Hmmm. This fixes (somewhat) the problem that I have noticed in previous patch that the Bloom filter was computed at least twice, once for BIDX chunk, once fo BDAT chunk. I think the order should be: - use Bloom filter on slab, if present - fill it from commit graph, if saved there - if needed, compute it from scratch (expensive operation!) If I understand it correctly, it now does it... though possibly with unnecessary memory allocation if commit-graph file does not include Bloomm filters data, and (re)computing is not requested (see later). But I might be wrong here. > > Reading from the commit-graph is based on the format in which bloom filters are > written in the commit graph file. See method `fill_filter_from_graph` in bloom.c This description reads a bit strange; it looks like it states a truism (we read in the format we wrote). It think it should be rephrased in different way for better readability. > > For reading the bloom filter for commit at lexicographic position i: I think it would better read as: To read Bloom filter for a given commit with lexicographic position 'i' we need to: > 1. Read BIDX[i] which essentially gives us the starting index in BDAT for filter > of commit i+1 (called the next_index in the code) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | \- I think would be not needed with better var name I would also add that it gives the position [one past] the end of Bloom filter data for i-th commit. > > 2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT for > filter of commit i (called the prev_index in the code) Minor nitpick: Full stops are missing. Why it is called prev_index and next_index, while it is either curr_index and next_index or prev_index and curr_index, or maybe even better beg_index and end_index? > For i = 0, prev_index will be 0. The first lexicographic commit's filter will > start at BDAT. I would state it For first commit, with i = 0, Bloom filter data starts at the beginning, just past the header in BDAT chunk. > > 3. The length of the filter will be next_index - prev_index, because BIDX[i] > gives the cumulative 8-byte words including the ith commit's filter. > > We toggle whether bloom filters should be recomputed based on the compute_if_null > flag. All right. > > Helped-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> > Signed-off-by: Garima Singh <garima.singh@xxxxxxxxxxxxx> > --- > bloom.c | 40 ++++++++++++++++++++++++++++++++++++++-- > bloom.h | 3 ++- > commit-graph.c | 6 +++--- > 3 files changed, 43 insertions(+), 6 deletions(-) > > diff --git a/bloom.c b/bloom.c > index 08328cc381..86b1005802 100644 > --- a/bloom.c > +++ b/bloom.c > @@ -1,5 +1,7 @@ > #include "git-compat-util.h" > #include "bloom.h" > +#include "commit.h" > +#include "commit-slab.h" > #include "commit-graph.h" > #include "object-store.h" > #include "diff.h" > @@ -119,13 +121,35 @@ static void add_key_to_filter(struct bloom_key *key, > } > } > > +static void fill_filter_from_graph(struct commit_graph *g, > + struct bloom_filter *filter, > + struct commit *c) > +{ > + uint32_t lex_pos, prev_index, next_index; > + > + while (c->graph_pos < g->num_commits_in_base) > + g = g->base_graph; > + > + lex_pos = c->graph_pos - g->num_commits_in_base; This part shares common code with load_oid_from_graph(), only without some of error checking; perhaps it might be good to extract it into a separate helper function, e.g. `lex_index(&g, c->graph_pos)`. Minor nitpick about the consistency of function names: why load_oid_from_graph(), but fill_filter_from_graph(), and not load_filter_from_graph() / load_bloom_from_graph()? > + > + next_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos); > + if (lex_pos) I think using + if (lex_pos > 0) or + if (lex_pos >= 0) might be easier to reason about. > + prev_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1)); > + else > + prev_index = 0; > + > + filter->len = next_index - prev_index; The above command reads a bit strange: next - prev? not next - curr, or curr - prev? Wouldn't it be better to name it begin_index and end_index, or beg_index and end_index for brevity? > + filter->data = (uint64_t *)(g->chunk_bloom_data + 8 * prev_index + 12); Please do not use magic constants; use instead something like: + filter->data = (uint64_t *)(g->chunk_bloom_data + + sizeof(uint64_t) * prev_index + + BLOOMDATA_CHUNK_HEADER_SIZE); Perhaps using `3*sizeof(unit32_t)` instead of magic value 12 would be enough; but having symbolic name for BDAT chunk header size is better, I think. > +} > + > void load_bloom_filters(void) > { > init_bloom_filter_slab(&bloom_filters); > } > > struct bloom_filter *get_bloom_filter(struct repository *r, > - struct commit *c) > + struct commit *c, > + int compute_if_null) I'm not sure about `compute_if_null` name... > { > struct bloom_filter *filter; > struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS; > @@ -134,6 +158,18 @@ struct bloom_filter *get_bloom_filter(struct repository *r, > const char *revs_argv[] = {NULL, "HEAD", NULL}; > > filter = bloom_filter_slab_at(&bloom_filters, c); Note that the documentation for `slab_at(slab, commit)` is documented in `commit-slab.h` as * This function locates the data associated with the given commit in * the indegree slab, and returns the pointer to it. The location to * store the data is allocated as necessary. ~~~~~~~~~~~~~~~~~~~~~~ Should we worry about this possibly unnecessary allocation (if there is no Bloom filter chunk in the commit-graph, and we are not recomputing it)? There is `slab_peek(slab_commit)` with the following properties: * This function is similar to indegree_at(), but it will return NULL * until a call to indegree_at() was made for the commit. > + > + if (!filter->data) { What does the Bloom filter for a commit with no changes looks like? What about for a commit with more than 512 changes? Do, in either of those cases, filter->len is 0? If yes, what about filter->data? > + load_commit_graph_info(r, c); > + if (c->graph_pos != COMMIT_NOT_FROM_GRAPH && r->objects->commit_graph->chunk_bloom_indexes) { Please wrap overly long lines (109 characters seems too long; the CodingGuidelines states: - We try to keep to at most 80 characters per line. + if (c->graph_pos != COMMIT_NOT_FROM_GRAPH && + r->objects->commit_graph->chunk_bloom_indexes) { You seem to assume here that in the chain of commit-graph files either all of them would have Bloom filters, or all of them would be missing Bloom filters. Isn't it however possible for only some of the commit-graph files in chain to include Bloom filter data chunks? In such case, the top commit-graph file may have BIDX chunk (bloom_indexes), but the commit-graph with the commit 'c' might not have it. Or the top commit-graph file may be missing BIDX chunk, so Git would recompute it even if the commit-graph file for commit 'c' includes it. If such situation is forbidden, how the restriction is managed? Note: in any case, this needs to be tested! > + fill_filter_from_graph(r->objects->commit_graph, filter, c); > + return filter; > + } > + } All right, if it is not in slab, we try to read if from the commit graph. Looks all right. > + > + if (filter->data || !compute_if_null) > + return filter; ^^^^^^^^ | \- one tab too many If we have found existing filter (on slab or in the commit-graph), or if we won't be recomputing it, return it. O.K. > + > init_revisions(&revs, NULL); > revs.diffopt.flags.recursive = 1; > > @@ -198,4 +234,4 @@ struct bloom_filter *get_bloom_filter(struct repository *r, > DIFF_QUEUE_CLEAR(&diff_queued_diff); > > return filter; > -} > \ No newline at end of file > +} Accidental change. > diff --git a/bloom.h b/bloom.h > index ba8ae70b67..101d689bbd 100644 > --- a/bloom.h > +++ b/bloom.h > @@ -36,7 +36,8 @@ struct bloom_key { > void load_bloom_filters(void); > > struct bloom_filter *get_bloom_filter(struct repository *r, > - struct commit *c); > + struct commit *c, > + int compute_if_null); > All right, this is just update of the function signature. > void fill_bloom_key(const char *data, > int len, > diff --git a/commit-graph.c b/commit-graph.c > index def2ade166..0580ce75d5 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -1032,7 +1032,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f, > uint32_t cur_pos = 0; > > while (list < last) { > - struct bloom_filter *filter = get_bloom_filter(ctx->r, *list); > + struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0); > cur_pos += filter->len; > hashwrite_be32(f, cur_pos); > list++; > @@ -1051,7 +1051,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f, > hashwrite_be32(f, settings->bits_per_entry); > > while (first < last) { > - struct bloom_filter *filter = get_bloom_filter(ctx->r, *first); > + struct bloom_filter *filter = get_bloom_filter(ctx->r, *first, 0); > hashwrite(f, filter->data, filter->len * sizeof(uint64_t)); > first++; > } O.K., so those two do not compute Bloom filters, but they are called from write_commit_graph_file(), which in turn is called in write_commit_graph() *after* running compute_bloom_filters(). Looks good to me, then. > @@ -1218,7 +1218,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx) > > for (i = 0; i < ctx->commits.nr; i++) { > struct commit *c = ctx->commits.list[i]; > - struct bloom_filter *filter = get_bloom_filter(ctx->r, c); > + struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1); > ctx->total_bloom_filter_size += sizeof(uint64_t) * filter->len; > display_progress(progress, i + 1); > } All right, so compute_bloom_filters() ensures that it is actually computed (if needed). Looks good to me. Regards, -- Jakub Narębski