This patch series implements the corrected commit date offsets as generation number v2, along with other pre-requisites. Git uses topological levels in the commit-graph file for commit-graph traversal operations like 'git log --graph'. Unfortunately, using topological levels can result in a worse performance than without them when compared with committer date as a heuristics. For example, 'git merge-base v4.8 v4.9' on the Linux repository walks 635,579 commits using topological levels and walks 167,468 using committer date. Since 091f4cf3 (commit: don't use generation numbers if not needed, 2018-08-30), 'git merge-base' uses committer date heuristic unless there is a cutoff because of the performance hit. Thus, the need for generation number v2 was born. New generation number needed to provide good performance, increment updates, and backward compatibility. Due to an unfortunate problem [1], we also needed a way to distinguish between the old and new generation number without incrementing graph version. [1] https://public-inbox.org/git/87a7gdspo4.fsf@xxxxxxxxxxxxxxxxxxx/ Various candidates were examined (https://github.com/derrickstolee/gen-test, https://github.com/abhishekkumar2718/git/pull/1). The proposed generation number v2, Corrected Commit Date with Mononotically Increasing Offsets performed much worse than committer date (506,577 vs. 167,468 commits walked for 'git merge-base v4.8 v4.9') and was dropped. Using Generation Data chunk (GDAT) relieves the requirement of backward compatibility as we would continue to store topological levels in Commit Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation number v2. The Corrected Commit Date is defined as follows: For a commit C, let its corrected commit date be the maximum of the commit date of C and the corrected commit dates of its parents plus 1. Then corrected commit date offset is the difference between corrected commit date of C and commit date of C. As a special case, a root commit with the timestamp zero has corrected commit date of 1 to be able to distinguish it from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit date). We will introduce an additional commit-graph chunk, Generation DATa (GDAT) chunk, and store corrected commit date offsets in GDAT chunk while storing topological levels in CDAT chunk. The old versions of Git would ignore GDAT chunk, using topological levels from CDAT chunk. In contrast, new versions of Git would use corrected commit dates, falling back to topological level if the generation data chunk is absent in the commit-graph file. While storing corrected commit date offsets saves us 4 bytes per commit (as compared with storing corrected commit dates directly), it's however possible for the offset to overflow the space allocated. To handle such cases, we introduce a new chunk, Generation Data Overflow (GDOV) that stores the corrected commit date. For overflowing offsets, we set MSB and store the position into the GDOV chunk, in a mechanism similar to the Extra Edges list chunk. For mixed generation number environment (for example new Git on the command line, old Git used by GUI client), we can encounter a mixed-chain commit-graph (a commit-graph chain where some of split commit-graph files have GDAT chunk and others do not). As backward compatibility is one of the goals, we can define the following behavior: While reading a mixed-chain commit-graph version, we fall back on topological levels as corrected commit dates and topological levels cannot be compared directly. When adding new layer to the split commit-graph file, and when merging some or all layers (replacing them in the latter case), the new layer will have GDAT chunk if and only if in the final result there would be no layer without GDAT chunk just below it. Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews. I look forward to everyone's reviews! Thanks * Abhishek ---------------------------------------------------------------------------- Improvements left for a future series: * Save commits with generation data overflow and extra edge commits instead of looping over all commits. cf. 858sbel67n.fsf@xxxxxxxxx * Verify both topological levels and corrected commit dates when present. cf. 85pn4tnk8u.fsf@xxxxxxxxx Changes in version 6: * Fixed typos in commit message for "commit-graph: implement corrected commit date". * Removed an unnecessary else-block in "commit-graph: implement corrected commit date". * Validate mixed generation chain correctly while writing in "commit-graph: use generation v2 only if the entire chain does". * Die if the GDAT chunk indicates data has overflown but there are is no generation data overflow chunk. Changes in version 5: * Explained a possible reason for no change in performance for "commit-graph: fix regression when computing bloom-filters" * Clarified about the addition of a new test for 11-digit octal implementations of ustar. * Fixed duplicate test names in "commit-graph: consolidate fill_commit_graph_info". * Swapped the order "commit-graph: return 64-bit generation number", "commit-graph: add a slab to store topological levels" to minimize lines changed. * Fixed the mismerge in "commit-graph: return 64-bit generation number" * Clarified the preparatory steps are for the larger goal of implementing generation number v2 in "commit-graph: return 64-bit generation number". * Moved the rename of "run_three_modes()" to "run_all_modes()" into a new patch "t6600-test-reach: generalize *_three_modes". * Explained and removed the checks for GENERATION_NUMBER_INFINITY that can never be true in "commit-graph: add a slab to store topological levels". * Fixed incorrect logic for verifying commit-graph in "commit-graph: implement corrected commit date". * Added minor improvements to commit message of "commit-graph: implement generation data chunk". * Added '--date ' option to test_commit() in 'test-lib-functions.sh' in "commit-graph: implement generation data chunk". * Improved coding style (also in tests) for "commit-graph: use generation v2 only if entire chain does". * Simplified test repository structure in "commit-graph: use generation v2 only if entire chain does" as only the number of commits in a split commit-graph layer are relevant. * Added a new test in "commit-graph: use generation v2 only if entire chain does" to check if the layers are merged correctly. * Explicitly mentioned commit "091f4cf3" in the commit-message of "commit-graph: use corrected commit dates in paint_down_to_common()". * Minor corrections to documentation in "doc: add corrected commit date info". * Minor corrections to coding style. Changes in version 4: * Added GDOV to handle overflows in generation data. * Added a test for writing tip graph for a generation number v2 graph chain in t5324-split-commit-graph.sh * Added a section on how mixed generation number chains are handled in Documentation/technical/commit-graph-format.txt * Reverted unimportant whitespace, style changes in commit-graph.c * Added header comments about the order of comparision for compare_commits_by_gen_then_commit_date in commit.h, compare_commits_by_gen in commit-graph.h * Elaborated on why t6404 fails with corrected commit date and must be run with GIT_TEST_COMMIT_GRAPH=1in the commit "commit-reach: use corrected commit dates in paint_down_to_common()" * Elaborated on write behavior for mixed generation number chains in the commit "commit-graph: use generation v2 only if entire chain does" * Added notes about adding the topo_level slab to struct write_commit_graph_context as well as struct commit_graph. * Clarified commit message for "commit-graph: consolidate fill_commit_graph_info" * Removed the claim "GDAT can store future generation numbers" because it hasn't been tested yet. Changes in version 3: * Reordered patches as discussed in 2 [https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@xxxxxxxxx/]. * Split "implement corrected commit date" into two patches - one introducing the topo level slab and other implementing corrected commit dates. * Extended split-commit-graph tests to verify at the end of test. * Use topological levels as generation number if any of split commit-graph files do not have generation data chunk. Changes in version 2: * Add tests for generation data chunk. * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write generation data chunk. * Compare commits with corrected commit dates if present in paint_down_to_common(). * Update technical documentation. * Handle mixed generation commit chains. * Improve commit messages for "commit-graph: fix regression when computing bloom filter", "commit-graph: consolidate fill_commit_graph_info", * Revert unnecessary whitespace changes. * Split uint_32 -> timestamp_t change into a new commit. Abhishek Kumar (11): commit-graph: fix regression when computing Bloom filters revision: parse parent in indegree_walk_step() commit-graph: consolidate fill_commit_graph_info t6600-test-reach: generalize *_three_modes commit-graph: add a slab to store topological levels commit-graph: return 64-bit generation number commit-graph: implement corrected commit date commit-graph: implement generation data chunk commit-graph: use generation v2 only if entire chain does commit-reach: use corrected commit dates in paint_down_to_common() doc: add corrected commit date info .../technical/commit-graph-format.txt | 28 +- Documentation/technical/commit-graph.txt | 77 +++++- commit-graph.c | 251 ++++++++++++++---- commit-graph.h | 15 +- commit-reach.c | 38 +-- commit-reach.h | 2 +- commit.c | 4 +- commit.h | 5 +- revision.c | 13 +- t/README | 3 + t/helper/test-read-graph.c | 4 + t/t4216-log-bloom.sh | 4 +- t/t5000-tar-tree.sh | 24 +- t/t5318-commit-graph.sh | 79 +++++- t/t5324-split-commit-graph.sh | 193 +++++++++++++- t/t6404-recursive-merge.sh | 5 +- t/t6600-test-reach.sh | 68 ++--- t/test-lib-functions.sh | 6 + upload-pack.c | 2 +- 19 files changed, 667 insertions(+), 154 deletions(-) base-commit: 4151fdb1c76c1a190ac9241b67223efd19f3e478 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v6 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v6 Pull-Request: https://github.com/gitgitgadget/git/pull/676 Range-diff vs v5: 1: c4e817abf7d ! 1: 4d8eb415578 commit-graph: fix regression when computing Bloom filters @@ Metadata ## Commit message ## commit-graph: fix regression when computing Bloom filters - Before computing Bloom fitlers, the commit-graph machinery uses + Before computing Bloom filters, the commit-graph machinery uses commit_gen_cmp to sort commits by generation order for improved diff performance. 3d11275505 (commit-graph: examine commits by generation number, 2020-03-30) claims that this sort can reduce the time spent to @@ Commit message 'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY while writing. - Not all hope is lost, though: 'commit_graph_generation()' falls back to + Not all hope is lost, though: 'commit_gen_cmp()' falls back to comparing commits by their date when they have equal generation number, - and so since c49c82aa4c is purely a date comparision function. This + and so since c49c82aa4c is purely a date comparison function. This heuristic is good enough that we don't seem to loose appreciable - performance while computing Bloom filters. Applying this patch (compared - with v2.29.1) speeds up computing Bloom filters by around ~4 - seconds. + performance while computing Bloom filters. + + Applying this patch (compared with v2.30.0) speeds up computing Bloom + filters by factors ranging from 0.40% to 5.19% on various repositories [1]. So, avoid the useless 'commit_graph_generation()' while writing by instead accessing the slab directly. This returns the newly-computed generation numbers, and allows us to avoid the heuristic by directly comparing generation numbers. + [1]: https://lore.kernel.org/git/20210105094535.GN8396@xxxxxxxxxx/ + Signed-off-by: Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> ## commit-graph.c ## -@@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb) +@@ commit-graph.c: static struct commit_graph_data *commit_graph_data_at(const struct commit *c) + return data; + } + ++/* ++ * Should be used only while writing commit-graph as it compares ++ * generation value of commits by directly accessing commit-slab. ++ */ + static int commit_gen_cmp(const void *va, const void *vb) + { const struct commit *a = *(const struct commit **)va; const struct commit *b = *(const struct commit **)vb; 2: 7645e0bcef0 = 2: 05dcb862818 revision: parse parent in indegree_walk_step() 3: ca646912b2b = 3: dcb9891d819 commit-graph: consolidate fill_commit_graph_info 4: 591935075f1 = 4: 4fbdee7ac90 t6600-test-reach: generalize *_three_modes 5: baae7006764 = 5: fbd8feb5d8c commit-graph: add a slab to store topological levels 6: 26bd6f49100 = 6: 855ff662a44 commit-graph: return 64-bit generation number 7: 859c39eff52 ! 7: 8fbe7486405 commit-graph: implement corrected commit date @@ Commit message of GDAT chunk, which is a reduction of around 6% in the size of commit-graph file. - However, using offsets be problematic if one of commits is malformed but - valid and has committerdate of 0 Unix time, as the offset would be the - same as corrected commit date and thus require 64-bits to be stored - properly. + However, using offsets be problematic if a commit is malformed but valid + and has committer date of 0 Unix time, as the offset would be the same + as corrected commit date and thus require 64-bits to be stored properly. While Git does not write out offsets at this stage, Git stores the corrected commit dates in member generation of struct commit_graph_data. @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph break; - } else if (level > max_level) { - max_level = level; -+ } else { -+ if (level > max_level) -+ max_level = level; -+ -+ if (corrected_commit_date > max_corrected_commit_date) -+ max_corrected_commit_date = corrected_commit_date; } ++ ++ if (level > max_level) ++ max_level = level; ++ ++ if (corrected_commit_date > max_corrected_commit_date) ++ max_corrected_commit_date = corrected_commit_date; } + if (all_parents_computed) { @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx) if (max_level > GENERATION_NUMBER_V1_MAX - 1) max_level = GENERATION_NUMBER_V1_MAX - 1; 8: 8403c4d0257 ! 8: 6d0696ae216 commit-graph: implement generation data chunk @@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct c + offset = (timestamp_t)get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index); + + if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) { ++ if (!g->chunk_generation_data_overflow) ++ die(_("commit-graph requires overflow generation data but has none")); ++ + offset_pos = offset ^ CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW; + graph_data->generation = get_be64(g->chunk_generation_data_overflow + 8 * offset_pos); + } else 9: a3a70a1edd0 ! 9: fba0d7f3dfe commit-graph: use generation v2 only if entire chain does @@ commit-graph.c: static void split_graph_merge_strategy(struct write_commit_graph g = g->base_graph; } @@ commit-graph.c: int write_commit_graph(struct object_directory *odb, - struct commit_graph *g = ctx->r->objects->commit_graph; + } else + ctx->num_commit_graphs_after = 1; - while (g) { -+ g->read_generation_data = 1; - g->topo_levels = &topo_levels; - g = g->base_graph; - } ++ validate_mixed_generation_chain(ctx->r->objects->commit_graph); ++ + compute_generation_numbers(ctx); + + if (ctx->changed_paths) @@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags) * also GENERATION_NUMBER_V1_MAX. Decrement to avoid extra logic * in the following condition. 10: 093101f908b = 10: ba1f2c5555f commit-reach: use corrected commit dates in paint_down_to_common() 11: 20299e57457 = 11: e571f03d8bd doc: add corrected commit date info -- gitgitgadget