Patrick Steinhardt <ps@xxxxxx> writes: > While it is now possible to filter objects by type, this mechanism is > for now mostly a convenience. Most importantly, we still have to iterate > through the whole packfile to find all objects of a specific type. This > can be prohibitively expensive depending on the size of the packfiles. > > It isn't really possible to do better than this when only considering a > packfile itself, as the order of objects is not fixed. But when we have > a packfile with a corresponding bitmap, either because the packfile > itself has one or because the multi-pack index has a bitmap for it, then > we can use these bitmaps to improve the runtime. > > While bitmaps are typically used to compute reachability of objects, > they also contain one bitmap per object type encodes which object has perhaps s/type encodes/type that encodes/ > what type. So instead of reading through the whole packfile(s), we can > use the bitmaps and iterate through the type-specific bitmap. Typically, > only a subset of packfiles will have a bitmap. But this isn't really > much of a problem: we can use bitmaps when available, and then use the > non-bitmap walk for every packfile that isn't covered by one. > > Overall, this leads to quite a significant speedup depending on how many > objects of a certain type exist. The following benchmarks have been > executed in the Chromium repository, which has a 50GB packfile with > almost 25 million objects: > > Benchmark 1: git cat-file --batch-check --batch-all-objects --unordered --buffer --no-objects-filter > Time (mean ± σ): 82.806 s ± 6.363 s [User: 30.956 s, System: 8.264 s] > Range (min … max): 73.936 s … 89.690 s 10 runs > > Benchmark 2: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tag > Time (mean ± σ): 20.8 ms ± 1.3 ms [User: 6.1 ms, System: 14.5 ms] > Range (min … max): 18.2 ms … 23.6 ms 127 runs > > Benchmark 3: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=commit > Time (mean ± σ): 1.551 s ± 0.008 s [User: 1.401 s, System: 0.147 s] > Range (min … max): 1.541 s … 1.566 s 10 runs > > Benchmark 4: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tree > Time (mean ± σ): 11.169 s ± 0.046 s [User: 10.076 s, System: 1.063 s] > Range (min … max): 11.114 s … 11.245 s 10 runs > > Benchmark 5: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=blob > Time (mean ± σ): 67.342 s ± 3.368 s [User: 20.318 s, System: 7.787 s] > Range (min … max): 62.836 s … 73.618 s 10 runs > > Benchmark 6: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=blob:none > Time (mean ± σ): 13.032 s ± 0.072 s [User: 11.638 s, System: 1.368 s] > Range (min … max): 12.960 s … 13.199 s 10 runs > > Summary > git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tag > 74.75 ± 4.61 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=commit > 538.17 ± 33.17 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tree > 627.98 ± 38.77 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=blob:none > 3244.93 ± 257.23 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=blob > 3990.07 ± 392.72 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --no-objects-filter > > The first benchmark is mostly equivalent in runtime compared to all the > others without the bitmap-optimization introduced in this commit. What > is noticeable in the benchmarks is that we're I/O-bound, not CPU-bound, > as can be seen from the user/system runtimes, which is often way lower > than the overall benchmarked runtime. > > Signed-off-by: Patrick Steinhardt <ps@xxxxxx> > --- > builtin/cat-file.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 50 insertions(+), 5 deletions(-) > > diff --git a/builtin/cat-file.c b/builtin/cat-file.c > index 25d5429e391..9021fd52f30 100644 > --- a/builtin/cat-file.c > +++ b/builtin/cat-file.c > @@ -21,6 +21,7 @@ > #include "streaming.h" > #include "oid-array.h" > #include "packfile.h" > +#include "pack-bitmap.h" > #include "object-file.h" > #include "object-name.h" > #include "object-store-ll.h" > @@ -805,7 +806,20 @@ static int batch_one_object_packed(const struct object_id *oid, > payload->payload); > } > > -static void batch_each_object(for_each_object_fn callback, > +static int batch_one_object_bitmapped(const struct object_id *oid, > + enum object_type type UNUSED, > + int flags UNUSED, > + uint32_t hash UNUSED, > + struct packed_git *pack, > + off_t offset, > + void *_payload) > +{ > + struct for_each_object_payload *payload = _payload; > + return payload->callback(oid, pack, offset, payload->payload); > +} > + > +static void batch_each_object(struct batch_options *opt, > + for_each_object_fn callback, > unsigned flags, > void *_payload) > { > @@ -813,9 +827,40 @@ static void batch_each_object(for_each_object_fn callback, > .callback = callback, > .payload = _payload, > }; > + struct bitmap_index *bitmap = prepare_bitmap_git(the_repository); > + > for_each_loose_object(batch_one_object_loose, &payload, 0); > - for_each_packed_object(the_repository, batch_one_object_packed, > - &payload, flags); > + > + if (bitmap && > + (opt->objects_filter.choice == LOFC_OBJECT_TYPE || > + opt->objects_filter.choice == LOFC_BLOB_NONE)) { > + struct packed_git *pack; > + > + if (opt->objects_filter.choice == LOFC_OBJECT_TYPE) { > + for_each_bitmapped_object(bitmap, opt->objects_filter.object_type, > + batch_one_object_bitmapped, &payload); > + } else { Nit: while this can be derived from the if statement above, it would be more readable if this was `if else (opt->objects_filter.choice == LOFC_BLOB_NONE)` > + for_each_bitmapped_object(bitmap, OBJ_COMMIT, > + batch_one_object_bitmapped, &payload); > + for_each_bitmapped_object(bitmap, OBJ_TAG, > + batch_one_object_bitmapped, &payload); > + for_each_bitmapped_object(bitmap, OBJ_TREE, > + batch_one_object_bitmapped, &payload); > + } > + > + for (pack = get_all_packs(the_repository); pack; pack = pack->next) { > + if (bitmap_index_contains_pack(bitmap, pack) || > + open_pack_index(pack)) > + continue; > + for_each_object_in_pack(pack, batch_one_object_packed, > + &payload, flags); > + } > + } else { > + for_each_packed_object(the_repository, batch_one_object_packed, > + &payload, flags); > + } > + > + free_bitmap_index(bitmap); > } > > static int batch_objects(struct batch_options *opt) > @@ -872,14 +917,14 @@ static int batch_objects(struct batch_options *opt) > > cb.seen = &seen; > > - batch_each_object(batch_unordered_object, > + batch_each_object(opt, batch_unordered_object, > FOR_EACH_OBJECT_PACK_ORDER, &cb); > > oidset_clear(&seen); > } else { > struct oid_array sa = OID_ARRAY_INIT; > > - batch_each_object(collect_object, 0, &sa); > + batch_each_object(opt, collect_object, 0, &sa); > oid_array_for_each_unique(&sa, batch_object_cb, &cb); > > oid_array_clear(&sa); > > -- > 2.48.1.683.gf705b3209c.dirty
Attachment:
signature.asc
Description: PGP signature