On 8/11/20 4:52 PM, Taylor Blau wrote: > To allow using the existing bitmap code with 64-bit words, we write the > data in network byte order from the 64-bit words. This means we also > need to read the array from the commit-graph file by translating each > word from network byte order using get_be64() when loading the commit > graph. (Note that this *could* be delayed until first-use, but a later > patch will rely on this being initialized early, so we assume the > up-front cost when parsing instead of delaying initialization). I don't think this is correct to do. This means that every commit-graph load will load the entire large bloom filter chunk into memory before allowing a single commit to be read from the graph. In the case of a very large commit-graph (> 1 million commits), this would cause a significant initial cost that is not necessarily needed for anything. For example, "git log -1" will be delayed by this action. If I understand correctly, this bloom_large bitmap is only used when writing Bloom filters. At that point, reading the entire bitmap from disk into memory is inexpensive compared to the time saved by the feature. > @@ -429,6 +430,24 @@ struct commit_graph *parse_commit_graph(struct repository *r, > graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8); > } > break; > + > + case GRAPH_CHUNKID_BLOOMLARGE: > + if (graph->chunk_bloom_large_filters) > + chunk_repeated = 1; > + else if (r->settings.commit_graph_read_changed_paths) { This is guarded against commitGraph.readChangedPaths, which is good, but that's not enough to claim that we need the bloom_large bitmap in this process. > + size_t alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t); If we store this inside the commit_graph struct, we can save this size for later so... > + graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t); > + graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset); ...this portion can be done only when we are about to read from the bitmap. > + if (alloc) { > + size_t j; > + graph->bloom_large = bitmap_word_alloc(alloc); > + > + for (j = 0; j < graph->bloom_large->word_alloc; j++) > + graph->bloom_large->words[j] = get_be64( > + graph->chunk_bloom_large_filters + j * sizeof(eword_t)); > + } > + } > + break; > } > > if (chunk_repeated) { > @@ -443,7 +462,9 @@ struct commit_graph *parse_commit_graph(struct repository *r, > /* We need both the bloom chunks to exist together. Else ignore the data */ > graph->chunk_bloom_indexes = NULL; > graph->chunk_bloom_data = NULL; > + graph->chunk_bloom_large_filters = NULL; > FREE_AND_NULL(graph->bloom_filter_settings); > + bitmap_free(graph->bloom_large); Perhaps this bitmap_free() needs a check to see if we initialized it (with my recommended lazy-loading), but maybe not? > } > > hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len); > @@ -932,6 +953,20 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit > return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c); > } > > +static int get_bloom_filter_large_in_graph(struct commit_graph *g, > + const struct commit *c) > +{ > + uint32_t graph_pos = commit_graph_position(c); > + if (graph_pos == COMMIT_NOT_FROM_GRAPH) > + return 0; > + > + while (g && graph_pos < g->num_commits_in_base) > + g = g->base_graph; > + > + if (!(g && g->bloom_large)) > + return 0; Here is where we can check if the size of the chunk is non-zero and if the g->bloom_large bitmap is uninitialized. Since we are caring about this information, it is now time to do the network-byte transition of the full bitmap. > + return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base); > +} > Thanks, -Stolee