Nicolas Pitre <nico@xxxxxxx> wrote: > On Mon, 11 Feb 2008, Martin Koegler wrote: > > > > But lots of code in git assums that the object content is welformd. > > > > Having such objects somewhere reachable in your repository will > > disturb cloning (In the best case you get a error messages, in the > > worst a child process of upload-pack segfaults), gitweb, ... . To fix > > it, you will need a person with native access to the repository in the > > worst case. We have *waaaay* overthought this. The solution is a lot simpler than Martin's patches make it out to be, and we can do it without changing our current memory footprint. > I think this is a good idea to always have some sanity checks on any > incoming objects so to make sure they're well formed and valid before > giving them a SHA1 value, and bail out as soon as any error is found. When we get the raw data for an object so we can compute its SHA-1 and/or write its loose object to disk we should first verify its content is sane, then do the SHA-1/store loose. Blobs - always assume sane. Remember that these make up the largest percentage in terms of raw bytes of any project's packfiles and we don't need to do any additional processing to them. Yay us. Tags - stack allocate a temporary struct tag and pass it's address into parse_tag_buffer(). The only thing it tries to allocate is the item->tagged. Throw an extra argument into parse_tag_buffer() to bypass this lookup block. Now there's no extra memory used for tags, just a bit more CPU. Tags aren't frequent. Commits - do like tags, but use parse_commit_buffer(). Avoid looking up the tree object and the parents, and allocating the parent references via a new argument. Also check !item->date like fsck does. Other than that I think the validation code in fsck is redundant with what parse_commit_buffer() is doing already. Trees - just run init_tree_desc() and tree_entry() loop like in track_tree_refs() (and a ton of other places) to iterate over the buffer. All we care about is sane entry names (no '/'), sane modes, and correct sorting. In short, we ignore reachability, but get fsck, and we can completely avoid additional malloc activity as well as any sort of heap increase in index-pack and unpack-objects. Its not hard. Its a small amount of refactoring for the parse functions we already have, and maybe expose a function or two from builtin-fsck (in particular a bulk of fsck_tree should be reused). Yea, its a tad more CPU time, but it shouldn't be that costly here. The huge cost in fsck is redoing the SHA-1 checksums, and inflating the deltas. We've already got the delta inflated, and we're about to compute the SHA-1 checksum for the first time. So those major costs of fsck drop to 0. > As to making sure those objects are well connected... well this is a > technically different issue entirely, and I wonder if a special mode to > fsck might not be a better solution. Nah, just do what quickfetch does in builtin-fetch.c, but run it in receive-pack, between unpack() and execute_commands(): rev-list --quiet --objects $new... --not --all If it aborts, reachability testing failed and the push is rejected without updating any refs. Yes your repository now has objects that are missing things, but none of those are considered to be reachable, so this isn't a big deal. They will get cleaned up on the next `gc --prune`, whenever that is. In this configuration (--quiet) rev-list tries to be pretty low on its memory usage, it doesn't save buffers, etc. Further since everything that is already considered reachable is not interesting, we are only doing a walk over the objects that we just received, not our entire ODB. Its also after index-pack exited, so we just freed up a good chunk of memory. Rememeber we are talking about receive-pack here. The cost on the to perform the rev-list is lower than the cost will be to pack these objects for distribution back to just one client. Since this is a server of some sorts (otherwise why did you push here?), odds are its going to be doing a lot of packing requests for clients to receive these newly uploaded objects by the native git protocol. This new rev-list is nothing compared to that already existing load. And if your OS is any good the just created .idx and .pack is still in OS buffer cache, so there shouldn't be any additional disk IO. Yes, we could make this optional in receive-pack, but really I don't see a reason to. Just run it. The client shouldn't be giving us unreachable crap. -- Shawn. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html