Re: [RFC Patch] Preventing corrupt objects from entering the repository

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> · Wed, 13 Feb 2008 01:20:15 -0500

Nicolas Pitre <nico@xxxxxxx> wrote:
> On Tue, 12 Feb 2008, Martin Koegler wrote:
> 
> > Many validations are in parse_XXX_buffer, which are also used by
> > fsck. This returns a struct commit/tree/tag/blob.
> > 
> > I have not found any code in git to free them. 

Yea, there isn't a way to free the struct commit/tree/tag/blob
currently.  Once allocated its inside the hash, cannot be removed,
and cannot be recycled.

A while back I had worked up a patch that allowed us to reset the
entire hash and recycle the all of the objects, but that isn't
really what we need here.

> Maybe we should add it then.  Especially in the usual case where we want 
> incoming objects to be validated.

Releasing a single object back shouldn't be too difficult, so long as
you release everything after each parse.  Remember parsing a commit
loads the parent list with struct commit*s.  Those are in the object
hash and a later parse_commit_buffer() may attempt to reuse those
if the same commit was a parent for more than one descendent.  :-\

Basically I think you want a smaller case of what I had done earlier
in my attempt to fix an error in git-fetch; release everything in
the object pool after each object has been parsed and validated.
We should only have a couple of objects allocated in the blocks
(one blob, one tree, one tag, handful of commits) and just reset
the blocks and hash between each.

> > Same for pack-objects, e.g. add_objects_in_unpacked_packs allocates
> > many struct object via lookup_unknown_object. As far as I understand
> > the code, they are never freed, even if they are not needed later.
> 
> Hmmm, that's bad.

You can't free them.  We're using their o->flags field to store
OBJECT_ADDED state so we don't pack the same object twice in our
output, even if it appears in multiple input packfiles.

This isn't a leak.  Its behaving as designed.  It is however
overallocating the object, because we are allocating an unknown
object rather than a specific object type.  We don't know what the
heck that SHA-1 is just from looking at the index.  If it is an
OBJ_{REF,OFS}_DELTA its too expensive to acquire the exact type
during this method.

Which makes me wonder, would it make sense to store the object type
code inside of the .idx?

There's a few places in Git (4 lookup_unknown_objects, 19 possible
sha1_object_infos) where we may be able to benefit from having fast
access to the base type for an object, without needing to walk its
delta chain to acquire that information.  It certainly would help
in this spot of pack-objects that Martin has pointed out.

But I really don't think it is worth the disk storage space
necessary.  We only need 3 bits, but there isn't a place to steal
them from in the index v2 file format.  So we're adding a full byte
per object.  Further we only make this lookup_unknown_object call
in 3 locations:

  http-push.c/walker.c:

    We're fetching objects, but we don't know what we are getting.
    I think these are only triggered during the loose object
    upload/download, which should be only a fraction of the total
    objects transferred, and are only for cases where the object
    isn't local so we can't check its type before allocation.

    No big deal.

  tree.c:

    This is inside track_tree_refs and its a file mode we do
    not recognize.  We spit out a warning, but try to lookup
    the unknown object anyway, even though we're looking at what
    should be a bogus tree.  This occurs only in parse_tree_buffer,
    only if track_objects_refs is on, and only if we see a tree
    that is actually not understood by this version of Git.  Hmm,
    shouldn't happen unless the tree itself is corrupt.

    This code looked fishy enough to me to dig up the blame history
    for it:

      commit e2ac7cb5fbcf1407003aa07cdcd14141527ea2e3
      Author: Sam Vilain <sam.vilain@xxxxxxxxxxxxxxx>
      Date: Wed Jun  6 06:25:17 2007

      Don't assume tree entries that are not dirs are blobs

      When scanning the trees in track_tree_refs() there is a "lazy" test
      that assumes that entries are either directories or files.  Don't do
      that.

    Sounds like it was around the time of S_ISGITLINK being
    introduced.  Looking at this code again I have to wonder, why
    the hell are we looking up and tracking an object inside of a
    tree when we don't understand the mode?

    Lets say a new form of S_ISGITLINK gets introduced in the future.
    By this I mean the mode says "hey, this SHA-1 isn't really in my
    object pool".  We're going to cram a dummy object into the refs
    table for this tree.  Why?  We don't do this for S_ISGITLINK.

    Lets say its a new object type (not tree/tag/commit/blob) but
    it is in our storage pool.  If this Git doesn't know the mode,
    it sure as heck won't know what the hell the loose object header
    (or pack header!) says about that object.

    One of the key places where you might expect tracking an unknown
    (but referenced by a tree) SHA-1 type would be in reachability,
    rev-list --objects, packfile generation.  But the process_tree()
    function in reachable.c doesn't have Sam's change above, so it
    will assume anything new looking is a blob.

    Oh, and what a rabbit hole I just fell down.  The only caller
    that seems to set "track_object_refs = 1" (and thus get into this
    odd lookup_unknown_object() call) is fsck.  Everyone else sets
    it to 0, including its default declaration.

    So we've got this nice baggage, and differing implementation,
    so fsck can be happy how?  We've also got a whole lot of apps
    setting "track_object_refs = 0", which is what it defaults to,
    unless you managed to run fsck first.  Hmmph.

    Is it just me or is track_object_refs perhaps outlived its
    usefulness?

  pack-objects:

    Here we are packing everything in an existing packfile, as we
    are doing something like a "git repack -A" and need to carry
    it all with us.

    We probably could be smarter than allocating an unknown object,
    but I'm not sure its worth it.  If you really are paying a
    high price here it means you are carrying a *huge* amount of
    unreachable garbage with you from packfile to packfile.

    Try running "git gc --prune" every few months.

-- 
Shawn.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html