Thanks everyone for the reviews. The only code change is to invoke the repacking of non-promisor objects into a promisor pack only when the first object that needs repacking is detected. If not, there will be an empty pack created, which is the cause of the failing tests Junio noticed. Other than that, there are some code comments and commit message changes as requested by reviews. I've checked that all tests pass before and after merging with "seen". Jonathan Tan (3): index-pack --promisor: dedup before checking links index-pack --promisor: don't check blobs index-pack --promisor: also check commits' trees builtin/index-pack.c | 101 +++++++++++++++++++++++++++++-------------- 1 file changed, 69 insertions(+), 32 deletions(-) Range-diff against v1: 1: 5f0f114dbd ! 1: 7ae21c921f index-pack: dedup first during outgoing link check @@ Metadata Author: Jonathan Tan <jonathantanmy@xxxxxxxxxx> ## Commit message ## - index-pack: dedup first during outgoing link check + index-pack --promisor: dedup before checking links Commit c08589efdc (index-pack: repack local links into promisor packs, 2024-11-01) fixed a bug with what was believed to be a negligible @@ builtin/index-pack.c: static void repack_local_links(void) + if (!oidset_size(&outgoing_links)) return; - base_name = mkpathdup("%s/pack/pack", repo_get_object_directory(the_repository)); -@@ builtin/index-pack.c: static void repack_local_links(void) - if (start_command(&cmd)) - die(_("could not start pack-objects to repack local links")); - -- oidset_iter_init(&local_links, &iter); +- base_name = mkpathdup("%s/pack/pack", repo_get_object_directory(the_repository)); + oidset_iter_init(&outgoing_links, &iter); - while ((oid = oidset_iter_next(&iter))) { ++ while ((oid = oidset_iter_next(&iter))) { + struct object_info info = OBJECT_INFO_INIT; + if (oid_object_info_extended(the_repository, oid, &info, 0)) + /* Missing; assume it is a promisor object */ + continue; + if (info.whence == OI_PACKED && info.u.packed.pack->pack_promisor) + continue; + +- strvec_push(&cmd.args, "pack-objects"); +- strvec_push(&cmd.args, "--exclude-promisor-objects-best-effort"); +- strvec_push(&cmd.args, base_name); +- cmd.git_cmd = 1; +- cmd.in = -1; +- cmd.out = -1; +- if (start_command(&cmd)) +- die(_("could not start pack-objects to repack local links")); ++ if (!cmd.args.nr) { ++ base_name = mkpathdup( ++ "%s/pack/pack", ++ repo_get_object_directory(the_repository)); ++ strvec_push(&cmd.args, "pack-objects"); ++ strvec_push(&cmd.args, ++ "--exclude-promisor-objects-best-effort"); ++ strvec_push(&cmd.args, base_name); ++ cmd.git_cmd = 1; ++ cmd.in = -1; ++ cmd.out = -1; ++ if (start_command(&cmd)) ++ die(_("could not start pack-objects to repack local links")); ++ } + +- oidset_iter_init(&local_links, &iter); +- while ((oid = oidset_iter_next(&iter))) { if (write_in_full(cmd.in, oid_to_hex(oid), the_hash_algo->hexsz) < 0 || write_in_full(cmd.in, "\n", 1) < 0) die(_("failed to feed local object to pack-objects")); + } ++ ++ if (!cmd.args.nr) ++ return; ++ + close(cmd.in); + + out = xfdopen(cmd.out, "r"); @@ builtin/index-pack.c: int cmd_index_pack(int argc, } else if (skip_to_optional_arg(arg, "--keep", &keep_msg)) { ; /* nothing to do */ 2: 300f53b8e3 ! 2: 5a63c9a5ca index-pack: no blobs during outgoing link check @@ Metadata Author: Jonathan Tan <jonathantanmy@xxxxxxxxxx> ## Commit message ## - index-pack: no blobs during outgoing link check + index-pack --promisor: don't check blobs As a follow-up to the parent of this commit, it was found that not checking for the existence of blobs linked from trees sped up the fetch from 24m47.815s to 2m2.127s. Teach Git to do that. - The benefit of doing this is as above (fetch speedup), but the drawback - is that if the packfile to be indexed references a local blob directly - (that is, not through a local tree), that local blob is in danger of - being garbage collected. Such a situation may arise if we push local - commits, including one with a change to a blob in the root tree, - and then the server incorporates them into its main branch through a - "rebase" or "squash" merge strategy, and then we fetch the new main - branch from the server. - - This situation has not been observed yet - we have only noticed missing - commits, not missing trees or blobs. (In fact, if it were believed that - only missing commits are problematic, one could argue that we should - also exclude trees during the outgoing link check; but it is safer to - include them.) - - Due to the rarity of the situation (it has not been observed to happen - in real life), and because the "penalty" in such a situation is merely - to refetch the missing blob when it's needed, the tradeoff seems - worth it. + The tradeoff of not checking blobs is documented in a code comment. (Blobs may also be linked from tag objects, but it is impossible to know the type of an object linked from a tag object without looking it up in @@ Commit message Signed-off-by: Jonathan Tan <jonathantanmy@xxxxxxxxxx> ## builtin/index-pack.c ## +@@ builtin/index-pack.c: static void record_outgoing_link(const struct object_id *oid) + oidset_insert(&outgoing_links, oid); + } + ++static void maybe_record_name_entry(const struct name_entry *entry) ++{ ++ /* ++ * The benefit of doing this is as above (fetch speedup), but the drawback ++is that if the packfile to be indexed references a local blob directly ++(that is, not through a local tree), that local blob is in danger of ++being garbage collected. Such a situation may arise if we push local ++commits, including one with a change to a blob in the root tree, ++and then the server incorporates them into its main branch through a ++"rebase" or "squash" merge strategy, and then we fetch the new main ++branch from the server. ++ ++This situation has not been observed yet - we have only noticed missing ++commits, not missing trees or blobs. (In fact, if it were believed that ++only missing commits are problematic, one could argue that we should ++also exclude trees during the outgoing link check; but it is safer to ++include them.) ++ ++Due to the rarity of the situation (it has not been observed to happen ++in real life), and because the "penalty" in such a situation is merely ++to refetch the missing blob when it's needed, the tradeoff seems ++worth it. ++ */ ++ if (S_ISDIR(entry->mode)) ++ record_outgoing_link(&entry->oid); ++} ++ + static void do_record_outgoing_links(struct object *obj) + { + if (obj->type == OBJ_TREE) { @@ builtin/index-pack.c: static void do_record_outgoing_links(struct object *obj) - * verified, so do not print any here. */ return; -- while (tree_entry_gently(&desc, &entry)) + while (tree_entry_gently(&desc, &entry)) - record_outgoing_link(&entry.oid); -+ while (tree_entry_gently(&desc, &entry)) { -+ if (S_ISDIR(entry.mode)) -+ record_outgoing_link(&entry.oid); -+ } ++ maybe_record_name_entry(&entry); } else if (obj->type == OBJ_COMMIT) { struct commit *commit = (struct commit *) obj; struct commit_list *parents = commit->parents; 3: 2f2f0db78b ! 3: 8139325bf2 index-pack: commit tree during outgoing link check @@ Metadata Author: Jonathan Tan <jonathantanmy@xxxxxxxxxx> ## Commit message ## - index-pack: commit tree during outgoing link check + index-pack --promisor: also check commits' trees Commit c08589efdc (index-pack: repack local links into promisor packs, 2024-11-01) seems to contain an oversight in that the tree of a commit - is not checked. The fix slows down a fetch from a certain repo at - $DAYJOB from 2m2.127s to 2m45.052s, but in order to make the fetch - correct, it seems worth it. + is not checked. Teach git to check these trees. + + The fix slows down a fetch from a certain repo at $DAYJOB from 2m2.127s + to 2m45.052s, but in order to make the fetch correct, it seems worth it. In order to test this, we could create server and client repos as follows... @@ Commit message (O and C are commits both on the client and server. S is a commit only on the server. C and S have the same tree but different commit - messages.) + messages. The diff between O and C is non-zero.) ...and then, from the client, fetch S from the server. -- 2.47.0.338.g60cca15819-goog