Re: [RFC Patch] Preventing corrupt objects from entering the repository

mkoegler@xxxxxxxxxxxxxxxxx (Martin Koegler) · Tue, 12 Feb 2008 22:38:58 +0100

On Tue, Feb 12, 2008 at 03:22:17PM -0500, Nicolas Pitre wrote:
> On Tue, 12 Feb 2008, Martin Koegler wrote:
> > We will need some additional memory for struct blob/tree/tag/commit
> > even for this check.
> 
> Why so?
> 
> Each received object is stored in memory when received, so why can't you 
> simply validate it in place?  No need to keep trace of them afterward.

Freeing the data is not my problem.

Many validations are in parse_XXX_buffer, which are also used by
fsck. This returns a struct commit/tree/tag/blob.

I have not found any code in git to free them. 

Same for pack-objects, e.g. add_objects_in_unpacked_packs allocates
many struct object via lookup_unknown_object. As far as I understand
the code, they are never freed, even if they are not needed later.

> > > As to making sure those objects are well connected... well this is a 
> > > technically different issue entirely, and I wonder if a special mode to 
> > > fsck might not be a better solution.  For example, fsck could be made to 
> > > validate object connectivity, starting from the new ref(s), and stopping 
> > > object walking as soon as a reference to an object not included in the 
> > > newly received pack is encountered.  This could be run from some hook to 
> > > decide whether or not to update the new refs, and to delete the pack 
> > > otherwise.
> > 
> > Do you really think, that this will need less memory? fsck loads first
> > all objects and then verifies their connections.
> 
> Not all objects otherwise I wouldn't even be able to run it.

It only loads all objects, it checks (eg. no objects from the pack file
by default). After loading a object, it frees the content of object,
but keeps the parsed information in memory for the connectivity check.

> My point is that you can have fsck load only objects contained in the 
> received pack (you can use the pack index to load them) and assume 
> connectivity is good whenever an object in the pack reference an 
> existing object outside of the pack.  At least this doesn't need to 
> happen in parallel with pack indexing.

So you propose the following:
1) run index-pack (write pack file + index in repository)
2) fsck pack file
3) update refs

This looks to be very prone to race conditions, if multiple
pushes run concurrently.

To be on the safe side, the checks must be finished, before a pack/object
becomes part of the repository. We must check the reachability before 
index-pack writes the index.

In my opinion, separating the code for the reachability check from
index-pack would complicate things. It would not safe memory, as the
memory would be used by two processes instead of one.

You don't want to increase the resource usage of index-pack. If
--strict is not passed, I implemented this. 

For --strict, it is not avoidable, that index-pack will use more
memory and cpu time.

mfg Martin Kögler

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html