Re: [PATCH v2 4/6] pack-bitmap: prepare to read lookup table extension

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 6/26/2022 9:10 AM, Abhradeep Chakraborty via GitGitGadget wrote:
> From: Abhradeep Chakraborty <chakrabortyabhradeep79@xxxxxxxxx>
> 
> Earlier change teaches Git to write bitmap lookup table. But Git
> does not know how to parse them.
> 
> Teach Git to parse the existing bitmap lookup table. The older
> versions of git are not affected by it. Those versions ignore the
> lookup table.
> 
> Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@xxxxxxxxx>
> Mentored-by: Taylor Blau <me@xxxxxxxxxxxx>
> Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@xxxxxxxxx>

I didn't check the previous patches, but your sign-off should be the
last line of the message. (You are singing off on all previous content,
and any later content is not covered by your sign-off.)

> +
> +		if (flags & BITMAP_OPT_LOOKUP_TABLE &&
> +			git_env_bool("GIT_TEST_READ_COMMIT_TABLE", 1)) {

nit: This alignment should use four spaces at the end so the second phrase
matches the start of the previous phrase. Like this:

		if (flags & BITMAP_OPT_LOOKUP_TABLE &&
		    git_env_bool("GIT_TEST_READ_COMMIT_TABLE", 1)) {

Perhaps it looked right in your editor because it renders tabs as 4 spaces
instead of 8 spaces.

> +			size_t table_size = 0;
> +			size_t triplet_sz = st_add3(sizeof(uint32_t),    /* commit position */
> +							sizeof(uint64_t),    /* offset */
> +							sizeof(uint32_t));    /* xor offset */

The 4- vs 8-space tab view would also explain the alignment here:

			size_t triplet_sz = st_add3(sizeof(uint32_t),  /* commit position */
						    sizeof(uint64_t),  /* offset */
						    sizeof(uint32_t)); /* xor offset */

(I also modified the comment alignment.)

Of course, since these values are constants and have no risk of overflowing,
perhaps we can drop st_add3() here:


			size_t triplet_sz = sizeof(uint32_t) + /* commit position */
					    sizeof(uint64_t) +  /* offset */
					    sizeof(uint32_t); /* xor offset */

> +			table_size = st_add(table_size,
> +					st_mult(ntohl(header->entry_count),
> +						triplet_sz));

Here, we _do_ want to keep the st_mult(). Is the st_add() still necessary? It
seems this is a leftover from the previous version that had the 4-byte flag
data.

We set table_size to zero above. We could drop that initialization and instead
have this after the "size_t triplet_sz" definition:

			size_t table_size = st_mult(ntohl(header->entry_count),
						    triplet_sz));

> +			if (table_size > index_end - index->map - header_size)
> +				return error("corrupted bitmap index file (too short to fit lookup table)");

Please add "_(...)" around the error message so it can be translated.

> +			index->table_lookup = (void *)(index_end - table_size);
> +			index_end -= table_size;
> +		}

> -	/* a 0 return code means the insertion succeeded with no changes,
> -	 * because the SHA1 already existed on the map. this is bad, there
> -	 * shouldn't be duplicated commits in the index */
> +	/* A 0 return code means the insertion succeeded with no changes,
> +	 * because the SHA1 already existed on the map. If lookup table
> +	 * is NULL, this is bad, there shouldn't be duplicated commits
> +	 * in the index.
> +	 *
> +	 * If table_lookup exists, that means the desired bitmap is already
> +	 * loaded. Either this bitmap has been stored directly or another
> +	 * bitmap has a direct or indirect xor relation with it. */

If we are modifying this multi-line comment, then we should reformat it to
match convention:

	/*
	 * The first sentence starts after the comment start
	 * so it has symmetry with the comment end which is on
	 * its own line.
	 */

>  	if (ret == 0) {
> -		error("Duplicate entry in bitmap index: %s", oid_to_hex(oid));
> -		return NULL;
> +		if (!index->table_lookup) {
> +			error("Duplicate entry in bitmap index: %s", oid_to_hex(oid));

Errors start with lowercase letters. Please add translation markers "_(...)"

> +static uint32_t triplet_get_xor_pos(const void *triplet)
> +{
> +	const void *p = (unsigned char*) triplet + st_add(sizeof(uint32_t), sizeof(uint64_t));

This st_add() is not necessary since the constants will not overflow.

> +	return get_be32(p);
> +}
> +
> +static int triplet_cmp(const void *va, const void *vb)
> +{
> +	int result = 0;
> +	uint32_t *a = (uint32_t *) va;
> +	uint32_t b = get_be32(vb);
> +	if (*a > b)
> +		result = 1;
> +	else if (*a < b)
> +		result = -1;
> +	else
> +		result = 0;
> +
> +	return result;
> +}
> +
> +static uint32_t bsearch_pos(struct bitmap_index *bitmap_git, struct object_id *oid,
> +						uint32_t *result)

Strange wrapping. Perhaps

static uint32_t bsearch_pos(struct bitmap_index *bitmap_git,
			    struct object_id *oid,
			    uint32_t *result)

> +{
> +	int found;
> +
> +	if (bitmap_git->midx)
> +		found = bsearch_midx(oid, bitmap_git->midx, result);
> +	else
> +		found = bsearch_pack(oid, bitmap_git->pack, result);
> +
> +	return found;

Here, we are doing a binary search on the entire list of packed objects, which could
use quite a few more hops than a binary search on the bitmapped commits.

> +static struct stored_bitmap *lazy_bitmap_for_commit(struct bitmap_index *bitmap_git,
> +					  struct commit *commit)
...
> +	int found = bsearch_pos(bitmap_git, oid, &commit_pos);
> +
> +	if (!found)
> +		return NULL;
> +
> +	triplet = bsearch(&commit_pos, bitmap_git->table_lookup, bitmap_git->entry_count,
> +						triplet_sz, triplet_cmp);

But I see, you are searching the pack-index for the position in the index, and _then_
searching the bitmap lookup table based on that position value.

I expected something different: binary search on the triplets where the comparison is
made by looking up the OID from the [multi-]pack-index and comparing that OID to the
commit OID we are looking for.

I'm not convinced that the binary search I had in mind is meaningfully faster than
what you've implemented here, so I'm happy to leave it as you have it. We can investigate
if that full search on the pack-index matters at all (it probably doesn't).

> +	if (!triplet)
> +		return NULL;
> +
> +	offset = triplet_get_offset(triplet);
> +	xor_pos = triplet_get_xor_pos(triplet);
> +
> +	if (xor_pos != 0xffffffff) {
> +		int xor_flags;
> +		uint64_t offset_xor;
> +		uint32_t *xor_positions;
> +		struct object_id xor_oid;
> +		size_t size = 0;
> +
> +		ALLOC_ARRAY(xor_positions, bitmap_git->entry_count);

While there is potential that this is wasteful, it's probably not that huge,
so we can start with the "maximum XOR depth" and then reconsider a smaller
allocation in the future.

> +		while (xor_pos != 0xffffffff) {

We should consider ensuring that also "size < bitmap_git->entry_count".
Better yet, create an xor_positions_alloc variable that is initialized
to the entry_count value.

"size" should probably be xor_positions_nr.

> +			xor_positions[size++] = xor_pos;
> +			triplet = bitmap_get_triplet(bitmap_git, xor_pos);
> +			xor_pos = triplet_get_xor_pos(triplet);
> +		}

(at this point, "if (xor_positions_nr >= xor_positions_alloc)", then error
out since the file must be malformed with an XOR loop.)

> +		while (size){

nit: ") {"

> +			xor_pos = xor_positions[size - 1];
> +			triplet = bitmap_get_triplet(bitmap_git, xor_pos);
> +			commit_pos = get_be32(triplet);
> +			offset_xor = triplet_get_offset(triplet);
> +
> +			if (nth_bitmap_object_oid(bitmap_git, &xor_oid, commit_pos) < 0) {
> +				free(xor_positions);
> +				return NULL;
> +			}
> +
> +			bitmap_git->map_pos = offset_xor + sizeof(uint32_t) + sizeof(uint8_t);
> +			xor_flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
> +			bitmap = read_bitmap_1(bitmap_git);
> +
> +			if (!bitmap){

nit: ") {"

> +				free(xor_positions);
> +				return NULL;
> +			}
> +
> +			xor_bitmap = store_bitmap(bitmap_git, bitmap, &xor_oid, xor_bitmap, xor_flags);

Since we are storing the bitmap here as we "pop" the stack, should we be
looking for a stored bitmap while pushing to the stack in the previous loop?
That would save time when using multiple bitmaps with common XOR bases.

(Of course, we want to be careful that we do not create a recursive loop,
but instead _only_ look at the in-memory bitmaps that already exist.)

> +			size--;
> +		}
> +
> +		free(xor_positions);
> +	}
> +
> +	bitmap_git->map_pos = offset + sizeof(uint32_t) + sizeof(uint8_t);
> +	flags = read_u8(bitmap_git->map, &bitmap_git->map_pos);
> +	bitmap = read_bitmap_1(bitmap_git);
> +
> +	if (!bitmap)
> +		return NULL;
> +
> +	return store_bitmap(bitmap_git, bitmap, oid, xor_bitmap, flags);
> +}
> +

I'm happy with the structure of this iterative algorithm!

I'll pause my review here for now.

Thanks,
-Stolee



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux