Re: parse_object does check_sha1_signature but not parse_object_buffer?

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 04 Feb 2016 09:40:54 -0800

Jeff King <peff@xxxxxxxx> writes:

> For local repository operations, most of them are about reading data.
> And there I generally favor performance over extra validation, with the
> caveat that we should always be aware of the tradeoff. An extra
> comparison to make sure we are not going out-of-bounds on a pack .idx
> pointer is cheap. Loading a blob just to make sure its sha1 is valid
> before we mention it in `diff --raw` output is stupid. Checking the sha1
> on objects we are otherwise accessing is somewhere in between. :)
>
> For local write operations, like repacking, we should err on the careful
> side. And I think we do a good job of balancing performance and
> validation there (e.g., we reuse deltas without reconstructing the
> object, but _with_ a crc check on the delta data itself).

OK, that is a reasonable set of rules.  We can say "checkout" is
writing out the contents to the filesystem, and make the "somewhere
in between" be closer to the "error on the careful side".

>> In fact, we do this, which is quite suboptimal:
>> 
>>         static int fsck_sha1(const unsigned char *sha1)
>>         {
>>                 struct object *obj = parse_object(sha1);
>>                 if (!obj) {
>>                         errors_found |= ERROR_OBJECT;
>>                         return error("%s: object corrupt or missing",
>>                                      sha1_to_hex(sha1));
>>                 }
>>                 obj->flags |= HAS_OBJ;
>>                 return fsck_obj(obj);
>>         }
>> 
>> This function is called for each loose object file we find in
>> fsck_object_dir(), and there are a few problems:
>> 
>>  * The function parse_object() called from here would issue an error
>>    message and returns NULL; then you get another "corrupt or
>>    missing" error message, because this code cannot tell from the
>>    NULL which is the case.
>> 
>>  * The intent of the callchain to fsck_sha1() is to iterate over
>>    loose object files xx/x{38} and validate what is contained in
>>    them, but that behaviour is not guaranteed because it calls
>>    parse_object(), which may get the object data from a packfile
>>    if the loose object is also in the packfile.
>> 
>> This function should instead take "path" (the only caller of this
>> function fsck_loose() has it), read the data in the file, hash the
>> data to validate that it matches "sha1" and then create the object
>> out of that data it read by calling parse_object_buffer().
>
> Yeah, I agree. I think as it is written, we also end up loading the
> loose objects twice (once in parse_object, and then later again in
> fsck_object to do the real work). Your solution would fix that, too.
> ...
> I guess nobody has really noticed either issue, because repositories
> large enough for it to make a difference will usually be packed.

I'll throw it in the leftover-bits list, then ;-)
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html