Re: parse_object does check_sha1_signature but not parse_object_buffer?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Mike Hommey <mh@xxxxxxxxxxxx> writes:

> On Mon, Feb 01, 2016 at 07:10:04PM -0800, Junio C Hamano wrote:
>> Mike Hommey <mh@xxxxxxxxxxxx> writes:
>> 
>> > Shouldn't parse_object_buffer also do check_sha1_signature?
>> 
>> In general, it shouldn't; its callers are supposed to do it as
>> additional check when/if needed.  Callers like the one in fsck.c
>> does not want to die after seeing one bad one.  We want to report
>> and keep checking other things.
>
> Shouldn't some things like, at least, `git checkout`, still check
> the sha1s, though?

That is a different issue--my answer was about the quoted question
regarding parse_object_buffer().  Its callers are supposed to do
additional check when/if needed, and there may be codepaths that
currently use parse_object_buffer() that may want to do their own
check, or call a different function that does the check for them.

Having said that, I do not necessarily think "git checkout" should
revalidate the object name.  The repository that you use for your
daily work would have the same error/corruption rate as your working
tree files, and I do not think you would constantly "validate" what
is in your working tree by comparing their contents with what you
think ought to be there.

If you are working on extremely poor quality disks and SSDs, it
might make sense to constantly revalidating the object data to catch
corruption early, as that is what we can do (as opposed to the
working tree files, corruption to which you probably do not have
anything to catch bitflipping on).

Unless you are placing your working tree on a filesystem with
checksums, but your object data would also be protected against
corruption in that case, so an extra revalidation at "git checkout"
time would not buy you much if anything at all.

http://article.gmane.org/gmane.comp.version-control.git/283380 (not
necessarily the entire thread, but that exact article) is a
reasonable summary that illustrates the way how we view the object
integrity.

    So "index-pack" is the enforcement point, and the rest of the
    git commands generally assume that we can trust what is on disk
    (as it is has either been generated by us, or checked by
    index-pack).  The rest of the commands do not spend time
    checking that the on-disk contents are sane (though you can run
    git-fsck if you want to do that).

If anything, we may want to reduce the number of codepaths that
calls check_sha1_signature() on data that we know we have read from
our own repository.  Even though I am not opposed to an idea to have
a "paranoid" mode that revalidates the object name every time (and
if "git checkout" does not currently, allow it to revalidate when we
are operating under the "paranoid" mode), I do not think it should
be on by default.

In fact, I have this suspicion that the original justification to
have the call to check_sha1_signature() in parse_object() might have
been invalidated by the restructuring of the code over the past 10
years.  e9eefa67 ([PATCH] Add function to parse an object of
unspecified type (take 2), 2005-04-28) says

    It also checks the hash of the file, due to its heritage as part
    of fsck-cache.

I.e. we do not need this call here, as long as we make sure that
fsck codepath does not depend on the fact that parse_object() calls
check_sha1_signature() to validate the consistency between the data
and the object name.

In fact, we do this, which is quite suboptimal:

        static int fsck_sha1(const unsigned char *sha1)
        {
                struct object *obj = parse_object(sha1);
                if (!obj) {
                        errors_found |= ERROR_OBJECT;
                        return error("%s: object corrupt or missing",
                                     sha1_to_hex(sha1));
                }
                obj->flags |= HAS_OBJ;
                return fsck_obj(obj);
        }

This function is called for each loose object file we find in
fsck_object_dir(), and there are a few problems:

 * The function parse_object() called from here would issue an error
   message and returns NULL; then you get another "corrupt or
   missing" error message, because this code cannot tell from the
   NULL which is the case.

 * The intent of the callchain to fsck_sha1() is to iterate over
   loose object files xx/x{38} and validate what is contained in
   them, but that behaviour is not guaranteed because it calls
   parse_object(), which may get the object data from a packfile
   if the loose object is also in the packfile.

This function should instead take "path" (the only caller of this
function fsck_loose() has it), read the data in the file, hash the
data to validate that it matches "sha1" and then create the object
out of that data it read by calling parse_object_buffer().

I didn't check other callers of parse_object(), but I doubt that
they need or want a check_sha1_signature() call in the function.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]