Re: [RFC Patch] Preventing corrupt objects from entering the repository

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> · Wed, 13 Feb 2008 02:42:09 -0500

Nicolas Pitre <nico@xxxxxxx> wrote:
> On Mon, 11 Feb 2008, Martin Koegler wrote:
> > 
> > But lots of code in git assums that the object content is welformd.
> > 
> > Having such objects somewhere reachable in your repository will
> > disturb cloning (In the best case you get a error messages, in the
> > worst a child process of upload-pack segfaults), gitweb, ... . To fix
> > it, you will need a person with native access to the repository in the
> > worst case. 

We have *waaaay* overthought this.  The solution is a lot simpler
than Martin's patches make it out to be, and we can do it without
changing our current memory footprint.

> I think this is a good idea to always have some sanity checks on any 
> incoming objects so to make sure they're well formed and valid before 
> giving them a SHA1 value, and bail out as soon as any error is found.  

When we get the raw data for an object so we can compute its SHA-1
and/or write its loose object to disk we should first verify its
content is sane, then do the SHA-1/store loose.

Blobs - always assume sane.  Remember that these make up the largest
percentage in terms of raw bytes of any project's packfiles and we
don't need to do any additional processing to them.  Yay us.

Tags - stack allocate a temporary struct tag and pass it's address
into parse_tag_buffer().  The only thing it tries to allocate is
the item->tagged.  Throw an extra argument into parse_tag_buffer()
to bypass this lookup block.  Now there's no extra memory used for
tags, just a bit more CPU.  Tags aren't frequent.

Commits - do like tags, but use parse_commit_buffer().  Avoid
looking up the tree object and the parents, and allocating the
parent references via a new argument.  Also check !item->date like
fsck does.  Other than that I think the validation code in fsck is
redundant with what parse_commit_buffer() is doing already.

Trees - just run init_tree_desc() and tree_entry() loop like in
track_tree_refs() (and a ton of other places) to iterate over the
buffer.  All we care about is sane entry names (no '/'), sane modes,
and correct sorting.

In short, we ignore reachability, but get fsck, and we can completely
avoid additional malloc activity as well as any sort of heap increase
in index-pack and unpack-objects.  Its not hard.

Its a small amount of refactoring for the parse functions we
already have, and maybe expose a function or two from builtin-fsck
(in particular a bulk of fsck_tree should be reused).

Yea, its a tad more CPU time, but it shouldn't be that costly here.
The huge cost in fsck is redoing the SHA-1 checksums, and inflating
the deltas.  We've already got the delta inflated, and we're about
to compute the SHA-1 checksum for the first time.  So those major
costs of fsck drop to 0.

> As to making sure those objects are well connected... well this is a 
> technically different issue entirely, and I wonder if a special mode to 
> fsck might not be a better solution.

Nah, just do what quickfetch does in builtin-fetch.c, but run it
in receive-pack, between unpack() and execute_commands():

	rev-list --quiet --objects $new... --not --all

If it aborts, reachability testing failed and the push is rejected
without updating any refs.  Yes your repository now has objects
that are missing things, but none of those are considered to be
reachable, so this isn't a big deal.  They will get cleaned up on
the next `gc --prune`, whenever that is.

In this configuration (--quiet) rev-list tries to be pretty low
on its memory usage, it doesn't save buffers, etc.  Further since
everything that is already considered reachable is not interesting,
we are only doing a walk over the objects that we just received,
not our entire ODB.  Its also after index-pack exited, so we just
freed up a good chunk of memory.

Rememeber we are talking about receive-pack here.  The cost on
the to perform the rev-list is lower than the cost will be to pack
these objects for distribution back to just one client.  Since this
is a server of some sorts (otherwise why did you push here?), odds
are its going to be doing a lot of packing requests for clients to
receive these newly uploaded objects by the native git protocol.
This new rev-list is nothing compared to that already existing load.
And if your OS is any good the just created .idx and .pack is still
in OS buffer cache, so there shouldn't be any additional disk IO.

Yes, we could make this optional in receive-pack, but really I don't
see a reason to.  Just run it.  The client shouldn't be giving us
unreachable crap.

-- 
Shawn.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html