Re: [PATCH 2/6] pack-bitmap: prepare to read lookup table extension

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 6/20/2022 8:33 AM, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@xxxxxxxxx>
> 
> Bitmap lookup table extension can let git to parse only the necessary
> bitmaps without loading the previous bitmaps one by one.

Here is an attempt to reword this a bit:

  The bitmap lookup table extension was documented by an earlier
  change, but Git does not yet know how to parse that information.
  The extension allows parsing a smaller portion of the bitmap
  file in order to find bitmaps for specific commits.

 
> Teach git to read and use the bitmap lookup table extension.

Normally, I don't mind doing the read portion after the write
portion, but it would be nice to have them in the opposite
order so we can test writing the extension _and Git ignoring
the extension_ before implementing the parsing. As it stands,
most of the code in this patch is untested until patch 5.

General outline attempt:

1. Document the format.
2. Write the extension if the flag is given.
3. Add pack.writeBitmapLookupTable and add tests that write
   the lookup table (and do other bitmap reads on that data).
4. Read the lookup table. The tests from step 3 already cover
   acting upon the lookup table. (Perhaps we add a mode here
   that disables GIT_READ_COMMIT_TABLE since that is not used
   anywhere else.)
5. Performance tests.

> +		if (flags & BITMAP_OPT_LOOKUP_TABLE &&
> +		    git_env_bool("GIT_READ_COMMIT_TABLE", 1)) {

This environment variable does not appear to be used or
documented anywhere. Do we really want to use it as a way
to disable reading the lookup table in general? Or would it be
better to have a GIT_TEST_* variable for disabling the read
during testing?

> +			uint32_t entry_count = ntohl(header->entry_count);
> +			uint32_t table_size =
> +				(entry_count * the_hash_algo->rawsz) /* oids */ +
> +				(entry_count * sizeof(uint32_t)) /* offsets */ +
> +				(entry_count * sizeof(uint32_t)) /* xor offsets */ +
> +				(sizeof(uint32_t)) /* flags */;

Here, uint32_T is probably fine, but maybe we should just use
size_t instead? Should we use st_mult() and st_add() everywhere?

Note: you're using the_hash_algo->rawsz here, which makes sense
because the bitmap format doesn't specify which hash algorithm is
used. Just making this note to say that we should include the hash
algorithm as a value in the bitmap format when we increment the
format version (in the future).

> +			if (table_size > index_end - index->map - header_size)
> +				return error("corrupted bitmap index file (too short to fit commit table)");
> +
> +			index->table_lookup = (void *)(index_end - table_size);
> +			index->table_offsets = index->table_lookup + the_hash_algo->rawsz * entry_count;

st_mult(), st_add()? Or, should we assume safety now?

> +			index_end -= table_size;
> +		}

> -	if (load_bitmap_entries_v1(bitmap_git) < 0)
> +	if (!bitmap_git->table_lookup && load_bitmap_entries_v1(bitmap_git) < 0)
>  		goto failed;

Ok, don't load these entries pre-emptively if we have the lookup table.

> +static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git,
> +						      struct commit *commit,
> +						      uint32_t *pos_hint);

I see that we have a two-method recursion loop. Please move this
declaration to immediately before lazy_bitmap_for_commit() so it
is declared as late as possible.

> +static inline const unsigned char *bitmap_oid_pos(struct bitmap_index *bitmap_git,
> +						  uint32_t pos)
> +{
> +	return bitmap_git->table_lookup + (pos * the_hash_algo->rawsz);
> +}

I would call this "bitmap_hash_pos()" because we are getting a raw
hash and not a 'struct object_id'. Do you want a helper that fills
a 'struct object_id', perhaps passed-by-reference?

> +static inline const void *bitmap_offset_pos(struct bitmap_index *bitmap_git,
> +					    uint32_t pos)

> +static inline const void *xor_position_pos(struct bitmap_index *bitmap_git,
> +					   uint32_t pos)

These two helpers should probably return a size_t and uint32_t
instead of a pointer. Let these do get_be[32|64]() on the computed
pointer.

> +static int bitmap_table_lookup(struct bitmap_index *bitmap_git,
> +			       struct object_id *oid,
> +			       uint32_t *commit_pos)
> +{
> +	unsigned char *found = bsearch(oid->hash, bitmap_git->table_lookup,
> +				       bitmap_git->entry_count,
> +				       the_hash_algo->rawsz, bitmap_lookup_cmp);
> +	if (found)
> +		*commit_pos = (found - bitmap_git->table_lookup) / the_hash_algo->rawsz;
> +	return !!found;

Ok, we are running binary search and converting the pointer into a
position.

Frequently, these kind of searches return an int, but use a negative
value to indicate that the value was not found. Using an int in this
way would restrict us to 2^31 bitmaps instead of 2^32, so maybe it
is not worth matching that practice.

> +static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_git,
> +						    struct object_id *oid,
> +						    uint32_t commit_pos)
> +{
> +	uint32_t xor_pos;
> +	off_t bitmap_ofs;
> +
> +	int flags;
> +	struct ewah_bitmap *bitmap;
> +	struct stored_bitmap *xor_bitmap;
> +
> +	bitmap_ofs = get_be32(bitmap_offset_pos(bitmap_git, commit_pos));
> +	xor_pos = get_be32(xor_position_pos(bitmap_git, commit_pos));

These lines become simpler with a change in the helper methods'
prototypes, as I recommended higher up.

> +	/*
> +	 * Lazily load the xor'd bitmap if required (and we haven't done so
> +	 * already). Make sure to pass the xor'd bitmap's position along as a
> +	 * hint to avoid an unnecessary binary search in
> +	 * stored_bitmap_for_commit().
> +	 */
> +	if (xor_pos == 0xffffffff) {
> +		xor_bitmap = NULL;
> +	} else {
> +		struct commit *xor_commit;
> +		struct object_id xor_oid;
> +
> +		oidread(&xor_oid, bitmap_oid_pos(bitmap_git, xor_pos));
> +
> +		xor_commit = lookup_commit(the_repository, &xor_oid);
> +		if (!xor_commit)
> +			return NULL;
> +
> +		xor_bitmap = stored_bitmap_for_commit(bitmap_git, xor_commit,
> +						      &xor_pos);
> +	}

This is using an interesting type of tail-recursion. We might be
better off using a loop with a stack: push to the stack the commit
positions of the XOR bitmaps. At the very bottom, we get a bitmap
without an XOR base. Then, pop off the stack, modifying the bitmap
with XOR operations as we go. (Perhaps we also store these bitmaps
in-memory along the way?) Finally, we have the necessary bitmap.

This iterative approach avoids possible stack exhaustion if there
are long XOR chains in the file.

> +
> +	/*
> +	 * Don't bother reading the commit's index position or its xor
> +	 * offset:
> +	 *
> +	 *   - The commit's index position is irrelevant to us, since
> +	 *     load_bitmap_entries_v1 only uses it to learn the object
> +	 *     id which is used to compute the hashmap's key. We already
> +	 *     have an object id, so no need to look it up again.
> +	 *
> +	 *   - The xor_offset is unusable for us, since it specifies how
> +	 *     many entries previous to ours we should look at. This
> +	 *     makes sense when reading the bitmaps sequentially (as in
> +	 *     load_bitmap_entries_v1()), since we can keep track of
> +	 *     each bitmap as we read them.
> +	 *
> +	 *     But it can't work for us, since the bitmap's don't have a
> +	 *     fixed size. So we learn the position of the xor'd bitmap
> +	 *     from the commit table (and resolve it to a bitmap in the
> +	 *     above if-statement).
> +	 *
> +	 * Instead, we can skip ahead and immediately read the flags and
> +	 * ewah bitmap.
> +	 */
> +	bitmap_git->map_pos = bitmap_ofs + sizeof(uint32_t) + sizeof(uint8_t);
> +	flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
> +	bitmap = read_bitmap_1(bitmap_git);
> +	if (!bitmap)
> +		return NULL;
> +
> +	return store_bitmap(bitmap_git, bitmap, oid, xor_bitmap, flags);

Looks like we'd want to call store_bitmap() while popping the stack
in the loop I recommended above.

> +}
> +
> +static struct stored_bitmap *stored_bitmap_for_commit(struct bitmap_index *bitmap_git,
> +						      struct commit *commit,
> +						      uint32_t *pos_hint)
>  {
>  	khiter_t hash_pos = kh_get_oid_map(bitmap_git->bitmaps,
>  					   commit->object.oid);
> -	if (hash_pos >= kh_end(bitmap_git->bitmaps))
> +	if (hash_pos >= kh_end(bitmap_git->bitmaps)) {
> +		uint32_t commit_pos;
> +		if (!bitmap_git->table_lookup)
> +			return NULL;
> +
> +		/* NEEDSWORK: cache misses aren't recorded. */
> +		if (pos_hint)
> +			commit_pos = *pos_hint;
> +		else if (!bitmap_table_lookup(bitmap_git,
> +					      &commit->object.oid,
> +					      &commit_pos))
> +			return NULL;
> +		return lazy_bitmap_for_commit(bitmap_git, &commit->object.oid,
> +					      commit_pos);

The extra bonus of going incremental is that we don't have recursion
across two methods, which I always find difficult to reason about.

> +	}
> +	return kh_value(bitmap_git->bitmaps, hash_pos);
> +}

> @@ -26,6 +26,7 @@ struct bitmap_disk_header {
>  enum pack_bitmap_opts {
>  	BITMAP_OPT_FULL_DAG = 1,
>  	BITMAP_OPT_HASH_CACHE = 4,
> +	BITMAP_OPT_LOOKUP_TABLE = 16,

Perhaps it is time to use hexadecimal representation here to match the
file format document?

Thanks,
-Stolee



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux