Re: [PATCH 9/9] builtin/cat-file: use bitmaps to efficiently filter by object type

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Patrick Steinhardt <ps@xxxxxx> writes:

> While it is now possible to filter objects by type, this mechanism is
> for now mostly a convenience. Most importantly, we still have to iterate
> through the whole packfile to find all objects of a specific type. This
> can be prohibitively expensive depending on the size of the packfiles.
>
> It isn't really possible to do better than this when only considering a
> packfile itself, as the order of objects is not fixed. But when we have
> a packfile with a corresponding bitmap, either because the packfile
> itself has one or because the multi-pack index has a bitmap for it, then
> we can use these bitmaps to improve the runtime.
>
> While bitmaps are typically used to compute reachability of objects,
> they also contain one bitmap per object type encodes which object has

perhaps s/type encodes/type that encodes/

> what type. So instead of reading through the whole packfile(s), we can
> use the bitmaps and iterate through the type-specific bitmap. Typically,
> only a subset of packfiles will have a bitmap. But this isn't really
> much of a problem: we can use bitmaps when available, and then use the
> non-bitmap walk for every packfile that isn't covered by one.
>
> Overall, this leads to quite a significant speedup depending on how many
> objects of a certain type exist. The following benchmarks have been
> executed in the Chromium repository, which has a 50GB packfile with
> almost 25 million objects:
>
>     Benchmark 1: git cat-file --batch-check --batch-all-objects --unordered --buffer --no-objects-filter
>       Time (mean ± σ):     82.806 s ±  6.363 s    [User: 30.956 s, System: 8.264 s]
>       Range (min … max):   73.936 s … 89.690 s    10 runs
>
>     Benchmark 2: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tag
>       Time (mean ± σ):      20.8 ms ±   1.3 ms    [User: 6.1 ms, System: 14.5 ms]
>       Range (min … max):    18.2 ms …  23.6 ms    127 runs
>
>     Benchmark 3: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=commit
>       Time (mean ± σ):      1.551 s ±  0.008 s    [User: 1.401 s, System: 0.147 s]
>       Range (min … max):    1.541 s …  1.566 s    10 runs
>
>     Benchmark 4: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tree
>       Time (mean ± σ):     11.169 s ±  0.046 s    [User: 10.076 s, System: 1.063 s]
>       Range (min … max):   11.114 s … 11.245 s    10 runs
>
>     Benchmark 5: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=blob
>       Time (mean ± σ):     67.342 s ±  3.368 s    [User: 20.318 s, System: 7.787 s]
>       Range (min … max):   62.836 s … 73.618 s    10 runs
>
>     Benchmark 6: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=blob:none
>       Time (mean ± σ):     13.032 s ±  0.072 s    [User: 11.638 s, System: 1.368 s]
>       Range (min … max):   12.960 s … 13.199 s    10 runs
>
>     Summary
>       git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tag
>        74.75 ± 4.61 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=commit
>       538.17 ± 33.17 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tree
>       627.98 ± 38.77 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=blob:none
>      3244.93 ± 257.23 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=blob
>      3990.07 ± 392.72 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --no-objects-filter
>
> The first benchmark is mostly equivalent in runtime compared to all the
> others without the bitmap-optimization introduced in this commit. What
> is noticeable in the benchmarks is that we're I/O-bound, not CPU-bound,
> as can be seen from the user/system runtimes, which is often way lower
> than the overall benchmarked runtime.
>
> Signed-off-by: Patrick Steinhardt <ps@xxxxxx>
> ---
>  builtin/cat-file.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 50 insertions(+), 5 deletions(-)
>
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 25d5429e391..9021fd52f30 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -21,6 +21,7 @@
>  #include "streaming.h"
>  #include "oid-array.h"
>  #include "packfile.h"
> +#include "pack-bitmap.h"
>  #include "object-file.h"
>  #include "object-name.h"
>  #include "object-store-ll.h"
> @@ -805,7 +806,20 @@ static int batch_one_object_packed(const struct object_id *oid,
>  				 payload->payload);
>  }
>
> -static void batch_each_object(for_each_object_fn callback,
> +static int batch_one_object_bitmapped(const struct object_id *oid,
> +				      enum object_type type UNUSED,
> +				      int flags UNUSED,
> +				      uint32_t hash UNUSED,
> +				      struct packed_git *pack,
> +				      off_t offset,
> +				      void *_payload)
> +{
> +	struct for_each_object_payload *payload = _payload;
> +	return payload->callback(oid, pack, offset, payload->payload);
> +}
> +
> +static void batch_each_object(struct batch_options *opt,
> +			      for_each_object_fn callback,
>  			      unsigned flags,
>  			      void *_payload)
>  {
> @@ -813,9 +827,40 @@ static void batch_each_object(for_each_object_fn callback,
>  		.callback = callback,
>  		.payload = _payload,
>  	};
> +	struct bitmap_index *bitmap = prepare_bitmap_git(the_repository);
> +
>  	for_each_loose_object(batch_one_object_loose, &payload, 0);
> -	for_each_packed_object(the_repository, batch_one_object_packed,
> -			       &payload, flags);
> +
> +	if (bitmap &&
> +	    (opt->objects_filter.choice == LOFC_OBJECT_TYPE ||
> +	     opt->objects_filter.choice == LOFC_BLOB_NONE)) {
> +		struct packed_git *pack;
> +
> +		if (opt->objects_filter.choice == LOFC_OBJECT_TYPE) {
> +			for_each_bitmapped_object(bitmap, opt->objects_filter.object_type,
> +						  batch_one_object_bitmapped, &payload);
> +		} else {

Nit: while this can be derived from the if statement above, it would be
more readable if this was `if else (opt->objects_filter.choice ==
LOFC_BLOB_NONE)`

> +			for_each_bitmapped_object(bitmap, OBJ_COMMIT,
> +						  batch_one_object_bitmapped, &payload);
> +			for_each_bitmapped_object(bitmap, OBJ_TAG,
> +						  batch_one_object_bitmapped, &payload);
> +			for_each_bitmapped_object(bitmap, OBJ_TREE,
> +						  batch_one_object_bitmapped, &payload);
> +		}
> +
> +		for (pack = get_all_packs(the_repository); pack; pack = pack->next) {
> +			if (bitmap_index_contains_pack(bitmap, pack) ||
> +			    open_pack_index(pack))
> +				continue;
> +			for_each_object_in_pack(pack, batch_one_object_packed,
> +						&payload, flags);
> +		}
> +	} else {
> +		for_each_packed_object(the_repository, batch_one_object_packed,
> +				       &payload, flags);
> +	}
> +
> +	free_bitmap_index(bitmap);
>  }
>
>  static int batch_objects(struct batch_options *opt)
> @@ -872,14 +917,14 @@ static int batch_objects(struct batch_options *opt)
>
>  			cb.seen = &seen;
>
> -			batch_each_object(batch_unordered_object,
> +			batch_each_object(opt, batch_unordered_object,
>  					  FOR_EACH_OBJECT_PACK_ORDER, &cb);
>
>  			oidset_clear(&seen);
>  		} else {
>  			struct oid_array sa = OID_ARRAY_INIT;
>
> -			batch_each_object(collect_object, 0, &sa);
> +			batch_each_object(opt, collect_object, 0, &sa);
>  			oid_array_for_each_unique(&sa, batch_object_cb, &cb);
>
>  			oid_array_clear(&sa);
>
> --
> 2.48.1.683.gf705b3209c.dirty

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux