On Sat, Apr 08, 2017 at 04:53:38PM +0200, Ævar Arnfjörð Bjarmason wrote: > Very late reply, but I have a question about this. Is there anything > you can do on the plumbing level to figure out which area an object is > in (of course that's not mutually exclusive). > > The use-case for that is e.g. having a hook that rejects large > binaries, but has an exception for a binary that's been added in the > past, before your change there's no optimal way to find that out, > you'd need to go over the whole history and list all blobs that have > ever been added, with your change it would be trivial if something > could give me "this object is in the quarantine area", but AFAICT > there's no API that'll show me that. No, I don't think there's a way to ask that. The best API to add it to would probably be "cat-file --batch-check", which can tell you similar things about the on-disk storage, like "is it a delta? against what?" and the on-disk size. _But_. Like those other things, it would have the big caveat that the object may appear multiple times. So if somebody sent you an object that was a duplicate of an existing one, you may or may not see it as being in the quarantine directory. I think a better rule for hooks is to complain about history which introduces new references to problematic objects. IOW, instead of looking at the object database, actually see what the history introduces: while read old new refname; do echo "$new" if test "$old" != 0000000000000000000000000000000000000000; then echo "^$old" fi done | git rev-list --objects --stdin | while read sha1 fn; do # various object checks against $sha1 done However, if large binaries are your interest, I will suggest one other thing: it's much cheaper to find them during index-pack, rather than by traversing history. At GitHub we run with a patch to index-pack that dumps the sha1s of any large blobs into .large-objects in the quarantine directory. You can abort there, of course, but it's hard to produce a implement any other policy or produce a nice message. But if you just note them and leave it to the pre-receive hook, then that hook can do the traversal to find, e.g., which pathnames reference the large blobs. And it only has to run at all when we know there's at least one such blob (and it can quit immediately when it finds it). I can share those patches, but they need some cleanup to be ready for inclusion. > Also, I think this whole thing could really do with some documentation > in githooks(5). E.g. what hooks does it apply for? The test is just > for pre-receive but the patch changes run_update_hook(), does it also > take effect for update? Ditto the caveat about update-ref. My thinking was that the cases where the effects were user-visible were sufficiently insane that nobody would need to know or care when the feature was in use. It isn't in use for the update hook. It can't be, because that is run per-ref. So the objects need to be pulled out of the quarantine before the first ref is updated (I think the call in run_update_hook is leftover cruft from before I realized that; it should be a noop because we'll always have migrated the objects by then, and tmp_objdir_env() will just return NULL). -Peff