Re: [PATCH 3/3] packfile: use oidset for bad objects

René Scharfe <l.s.r@xxxxxx> · Sat, 11 Sep 2021 18:08:38 +0200

Am 11.09.21 um 16:26 schrieb Jeff King:
> On Sat, Sep 11, 2021 at 10:01:40AM +0200, René Scharfe wrote:
>
>> Store the object ID of broken pack entries in an oidset instead of
>> keeping only their hashes in an unsorted array.  The resulting code is
>> shorter and easier to read.  It also handles the (hopefully) very rare
>> case of having a high number of bad objects better.
>
> Yay, I'm very happy to see this kind of cleanup replacing ad hoc data
> structures with well-tested ones.
>
>> @@ -303,15 +304,9 @@ static int nth_midxed_pack_entry(struct repository *r,
>>  	if (!is_pack_valid(p))
>>  		return 0;
>>
>> -	if (p->num_bad_objects) {
>> -		uint32_t i;
>> -		struct object_id oid;
>> -		nth_midxed_object_oid(&oid, m, pos);
>> -		for (i = 0; i < p->num_bad_objects; i++)
>> -			if (hasheq(oid.hash,
>> -				   p->bad_object_sha1 + the_hash_algo->rawsz * i))
>> -				return 0;
>> -	}
>> +	nth_midxed_object_oid(&oid, m, pos);
>> +	if (oidset_contains(&p->bad_objects, &oid))
>> +		return 0;
>
> Calling nth_midxed_object_oid() implies a memcpy() under the hood. In
> the old code, we'd skip that in the common case that we had no corrupt
> objects, but now we'll pay the cost regardless. memcpy() isn't _that_
> expensive, but I'd expect this to be a relatively hot code path.
>
> Is it worth sticking all of this inside:
>
>   if (oidset_size(&p->bad_objects))
>
> ?

Hard to say.  It would certainly match the old code more closely.  Is a
function call cheaper than copying 32 bytes?  Depends on the CPU and
whether the hash is cached, I guess.  And cached it probably is, because
the caller did a binary search for it..

We can pass on the original oid to avoid the nth_midxed_object_oid()
call, but inlining the whole thing might even be nicer.

René