From: Derrick Stolee <dstolee@xxxxxxxxxxxxx> Thomas reported [1] that a "git fetch" command was failing with an error saying "unexpected duplicate commit id". The root cause is that they had fetch.writeCommitGraph enabled which generates commit-graph chains, and this instance was merging two layers that both contained the same commit ID. [1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@xxxxxxxxxxxxxxxxxxx/ The initial assumption is that Git would not write a commit ID into a commit-graph layer if it already exists in a lower commit-graph layer. Somehow, this specific case did get into that situation, leading to this error. While unexpected, this isn't actually invalid (as long as the two layers agree on the metadata for the commit). When we parse a commit that does not have a graph_pos in the commit_graph_data_slab, we use binary search in the commit-graph layers to find the commit and set graph_pos. That position is never used again in this case. However, when we parse a commit from the commit-graph file, we load its parents from the commit-graph and assign graph_pos at that point. If those parents were already parsed from the commit-graph, then nothing needs to be done. Otherwise, this graph_pos is a valid position in the commit-graph so we can parse the parents, when necessary. Thus, this die() is too aggressive. The easiest thing to do would be to ignore the duplicates. If we only ignore the duplicates, then we will produce a commit-graph that has identical commit IDs listed in adjacent positions. This excess data will never be removed from the commit-graph, which could cascade into significantly bloated file sizes. Thankfully, we can collapse the list to erase the duplicate commit pointers. This allows us to get the end result we want without extra memory costs and minimal CPU time. Since the root cause for producing commit-graph layers with these duplicate commits is currently unknown, it is difficult to create a test for this scenario. For now, we must rely on testing the example data graciously provided in [1]. My local test successfully merged layers, and 'git commit-graph verify' passed. Reported-by: Thomas Braun <thomas.braun@xxxxxxxxxxxxxxxxxxx> Helped-by: Taylor Blau <me@xxxxxxxxxxxx> Co-authored-by: Jeff King <peff@xxxxxxxx> Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> --- commit-graph: ignore duplicates when merging layers This wasn't quite as simple as what Peff had posted, since we really don't want to keep duplicate commits around in the new merged layer. I still don't have a grasp on how this happened in the first place, but will keep looking. Thanks, -Stolee Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-747%2Fderrickstolee%2Fcommit-graph-dup-commits-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-747/derrickstolee/commit-graph-dup-commits-v2 Pull-Request: https://github.com/gitgitgadget/git/pull/747 Range-diff vs v1: 1: b3342403f1 ! 1: 85f4e578b8 commit-graph: ignore duplicates when merging layers @@ Commit message Otherwise, this graph_pos is a valid position in the commit-graph so we can parse the parents, when necessary. - Thus, this die() is too aggignoring the duplicates. + Thus, this die() is too aggressive. The easiest thing to do would be to + ignore the duplicates. - This leads to some additional complication that we did no have before: - if we only ignore the duplicates, then we will produce a commit-graph + If we only ignore the duplicates, then we will produce a commit-graph that has identical commit IDs listed in adjacent positions. This excess data will never be removed from the commit-graph, which could cascade into significantly bloated file sizes. - Begrudgingly, the best way to fix this is to copy the commit pointers - into a new list that only contains de-duplicated commit IDs. This adds a - slight memory overhead, but it is small compared to having all of these - commits parsed in memory, so it should be an acceptable cost for - avoiding these failures. + Thankfully, we can collapse the list to erase the duplicate commit + pointers. This allows us to get the end result we want without extra + memory costs and minimal CPU time. Since the root cause for producing commit-graph layers with these duplicate commits is currently unknown, it is difficult to create a test @@ Commit message and 'git commit-graph verify' passed. Reported-by: Thomas Braun <thomas.braun@xxxxxxxxxxxxxxxxxxx> + Helped-by: Taylor Blau <me@xxxxxxxxxxxx> Co-authored-by: Jeff King <peff@xxxxxxxx> Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> commit-graph.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index cb042bdba8..29bac78dc3 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -2009,6 +2009,7 @@ static int commit_compare(const void *_a, const void *_b) static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx) { uint32_t i; + struct packed_commit_list deduped_commits = { NULL, 0, 0 }; if (ctx->report_progress) ctx->progress = start_delayed_progress( @@ -2016,6 +2017,8 @@ static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx) ctx->commits.nr); QSORT(ctx->commits.list, ctx->commits.nr, commit_compare); + deduped_commits.alloc = ctx->commits.nr; + ALLOC_ARRAY(deduped_commits.list, deduped_commits.alloc); ctx->num_extra_edges = 0; for (i = 0; i < ctx->commits.nr; i++) { @@ -2023,17 +2026,30 @@ static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx) if (i && oideq(&ctx->commits.list[i - 1]->object.oid, &ctx->commits.list[i]->object.oid)) { - die(_("unexpected duplicate commit id %s"), - oid_to_hex(&ctx->commits.list[i]->object.oid)); + /* + * Silently ignore duplicates. These were likely + * created due to a commit appearing in multiple + * layers of the chain, which is unexpected but + * not invalid. We should make sure there is a + * unique copy in the new layer. + */ } else { unsigned int num_parents; + deduped_commits.list[deduped_commits.nr] = ctx->commits.list[i]; + deduped_commits.nr++; + num_parents = commit_list_count(ctx->commits.list[i]->parents); if (num_parents > 2) ctx->num_extra_edges += num_parents - 1; } } + free(ctx->commits.list); + ctx->commits.list = deduped_commits.list; + ctx->commits.nr = deduped_commits.nr; + ctx->commits.alloc = deduped_commits.alloc; + stop_progress(&ctx->progress); } base-commit: d98273ba77e1ab9ec755576bc86c716a97bf59d7 -- gitgitgadget