Re: [PATCH 4/6] receive-pack: quarantine objects until pre-receive accepts

Jeff King <peff@xxxxxxxx> · Mon, 10 Apr 2017 17:14:24 -0400

On Sat, Apr 08, 2017 at 04:53:38PM +0200, Ævar Arnfjörð Bjarmason wrote:

> Very late reply, but I have a question about this. Is there anything
> you can do on the plumbing level to figure out which area an object is
> in (of course that's not mutually exclusive).
> 
> The use-case for that is e.g. having a hook that rejects large
> binaries, but has an exception for a binary that's been added in the
> past, before your change there's no optimal way to find that out,
> you'd need to go over the whole history and list all blobs that have
> ever been added, with your change it would be trivial if something
> could give me "this object is in the quarantine area", but AFAICT
> there's no API that'll show me that.

No, I don't think there's a way to ask that. The best API to add it to
would probably be "cat-file --batch-check", which can tell you similar
things about the on-disk storage, like "is it a delta? against what?"
and the on-disk size.

_But_. Like those other things, it would have the big caveat that the
object may appear multiple times. So if somebody sent you an object that
was a duplicate of an existing one, you may or may not see it as being
in the quarantine directory.

I think a better rule for hooks is to complain about history which
introduces new references to problematic objects. IOW, instead of
looking at the object database, actually see what the history
introduces:

  while read old new refname; do
	echo "$new"
	if test "$old" != 0000000000000000000000000000000000000000; then
		echo "^$old"
	fi
  done |
  git rev-list --objects --stdin |
  while read sha1 fn; do
    # various object checks against $sha1
  done

However, if large binaries are your interest, I will suggest one other
thing: it's much cheaper to find them during index-pack, rather than by
traversing history. At GitHub we run with a patch to index-pack that
dumps the sha1s of any large blobs into .large-objects in the quarantine
directory. You can abort there, of course, but it's hard to produce a
implement any other policy or produce a nice message. But if you just
note them and leave it to the pre-receive hook, then that hook can do
the traversal to find, e.g., which pathnames reference the large blobs.
And it only has to run at all when we know there's at least one such
blob (and it can quit immediately when it finds it).

I can share those patches, but they need some cleanup to be ready for
inclusion.

> Also, I think this whole thing could really do with some documentation
> in githooks(5). E.g. what hooks does it apply for? The test is just
> for pre-receive but the patch changes run_update_hook(), does it also
> take effect for update? Ditto the caveat about update-ref.

My thinking was that the cases where the effects were user-visible were
sufficiently insane that nobody would need to know or care when the
feature was in use.

It isn't in use for the update hook. It can't be, because that is run
per-ref. So the objects need to be pulled out of the quarantine before
the first ref is updated (I think the call in run_update_hook is
leftover cruft from before I realized that; it should be a noop because
we'll always have migrated the objects by then, and tmp_objdir_env()
will just return NULL).

-Peff