Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes: > When an object to be packed is noticed to be missing, prefetch all > to-be-packed objects in one batch. > > Signed-off-by: Jonathan Tan <jonathantanmy@xxxxxxxxxx> > --- Hmph, the resulting codeflow structure feels somewhat iffy. Perhaps I am not reading the code correctly, but * There is a loop that scans from 0..to_pack.nr_objects and calls check_object() for each and every one of them; * The called check_object(), when it notices that a missing and promised (i.e. to be lazily fetched) object is in the to_pack array, asks prefetch_to_pack() to scan from that point to the end of that array and grabs all of them that are missing. It almost feels a lot cleaner to see what is going on in the resulting code, instead of the way the new "loop" was added, if a new loop is added _before_ the loop to call check_object() on all objects in to_pack array as a pre-processing phase when there is a promisor remote. That is, after reverting all the change this patch makes to check_object(), add a new loop in get_object_details() that looks more or less like so: QSORT(sorted_by_offset, to_pack.nr_objects, pack_offset_sort); + if (has_promisor_remote()) + prefetch_to_pack(0); + for (i = 0; i < to_pack.nr_objects; i++) { Was the patch done this way because scanning the entire array twice is expensive? The optimization makes sense to me if certain conditions are met, like... - Most of the time there is no missing object due to promisor, even if has_promissor_to_remote() is true; - When there are missing objects due to promisor, pack_offset_sort will keep them near the end of the array; and - Given the oid, oid_object_info_extended() on it with OBJECT_INFO_FOR_PREFETCH is expensive. Only when all these conditions are met, it would avoid unnecessary overhead by scanning only a very later part of the array by delaying the point in the array where prefetch_to_pack() starts scanning. Thanks. > There have been recent discussions about using QUICK whenever we use > SKIP_FETCH_OBJECT. I don't think it fully applies here, since here we > fully expect the object to be present in the non-partial-clone case. > Having said that, I wouldn't be opposed to adding QUICK and then, if the > object read fails and if the repo is not a partial clone, to retry the > object load (before setting the type to -1). > --- > builtin/pack-objects.c | 36 ++++++++++++++++++++++++++++++++---- > t/t5300-pack-object.sh | 36 ++++++++++++++++++++++++++++++++++++ > 2 files changed, 68 insertions(+), 4 deletions(-) > > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > index e09d140eed..ecef5cda44 100644 > --- a/builtin/pack-objects.c > +++ b/builtin/pack-objects.c > @@ -35,6 +35,7 @@ > #include "midx.h" > #include "trace2.h" > #include "shallow.h" > +#include "promisor-remote.h" > > #define IN_PACK(obj) oe_in_pack(&to_pack, obj) > #define SIZE(obj) oe_size(&to_pack, obj) > @@ -1704,7 +1705,26 @@ static int can_reuse_delta(const struct object_id *base_oid, > return 0; > } > > -static void check_object(struct object_entry *entry) > +static void prefetch_to_pack(uint32_t object_index_start) { > + struct oid_array to_fetch = OID_ARRAY_INIT; > + uint32_t i; > + > + for (i = object_index_start; i < to_pack.nr_objects; i++) { > + struct object_entry *entry = to_pack.objects + i; > + > + if (!oid_object_info_extended(the_repository, > + &entry->idx.oid, > + NULL, > + OBJECT_INFO_FOR_PREFETCH)) > + continue; > + oid_array_append(&to_fetch, &entry->idx.oid); > + } > + promisor_remote_get_direct(the_repository, > + to_fetch.oid, to_fetch.nr); > + oid_array_clear(&to_fetch); > +} > + > +static void check_object(struct object_entry *entry, uint32_t object_index) > { > unsigned long canonical_size; > enum object_type type; > @@ -1843,8 +1863,16 @@ static void check_object(struct object_entry *entry) > } > > if (oid_object_info_extended(the_repository, &entry->idx.oid, &oi, > - OBJECT_INFO_LOOKUP_REPLACE) < 0) > - type = -1; > + OBJECT_INFO_SKIP_FETCH_OBJECT | OBJECT_INFO_LOOKUP_REPLACE) < 0) { > + if (has_promisor_remote()) { > + prefetch_to_pack(object_index); > + if (oid_object_info_extended(the_repository, &entry->idx.oid, &oi, > + OBJECT_INFO_SKIP_FETCH_OBJECT | OBJECT_INFO_LOOKUP_REPLACE) < 0) > + type = -1; > + } else { > + type = -1; > + } > + } > oe_set_type(entry, type); > if (entry->type_valid) { > SET_SIZE(entry, canonical_size); > @@ -2065,7 +2093,7 @@ static void get_object_details(void) > > for (i = 0; i < to_pack.nr_objects; i++) { > struct object_entry *entry = sorted_by_offset[i]; > - check_object(entry); > + check_object(entry, i); > if (entry->type_valid && > oe_size_greater_than(&to_pack, entry, big_file_threshold)) > entry->no_try_delta = 1; > diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh > index 746cdb626e..d553d0ca46 100755 > --- a/t/t5300-pack-object.sh > +++ b/t/t5300-pack-object.sh > @@ -497,4 +497,40 @@ test_expect_success 'make sure index-pack detects the SHA1 collision (large blob > ) > ' > > +test_expect_success 'prefetch objects' ' > + rm -rf server client && > + > + git init server && > + test_config -C server uploadpack.allowanysha1inwant 1 && > + test_config -C server uploadpack.allowfilter 1 && > + test_config -C server protocol.version 2 && > + > + echo one >server/one && > + git -C server add one && > + git -C server commit -m one && > + git -C server branch one_branch && > + > + echo two_a >server/two_a && > + echo two_b >server/two_b && > + git -C server add two_a two_b && > + git -C server commit -m two && > + > + echo three >server/three && > + git -C server add three && > + git -C server commit -m three && > + git -C server branch three_branch && > + > + # Clone, fetch "two" with blobs excluded, and re-push it. This requires > + # the client to have the blobs of "two" - verify that these are > + # prefetched in one batch. > + git clone --filter=blob:none --single-branch -b one_branch \ > + "file://$(pwd)/server" client && > + test_config -C client protocol.version 2 && > + TWO=$(git -C server rev-parse three_branch^) && > + git -C client fetch --filter=blob:none origin "$TWO" && > + GIT_TRACE_PACKET=$(pwd)/trace git -C client push origin "$TWO":refs/heads/two_branch && > + grep "git> done" trace >donelines && > + test_line_count = 1 donelines > +' > + > test_done