Re: [PATCH v2] receive-pack: not receive pack file with large object

Jeff King <peff@xxxxxxxx> · Fri, 1 Oct 2021 02:55:37 -0400

On Thu, Sep 30, 2021 at 03:42:29PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > From: Han Xin <hanxin.hx@xxxxxxxxxxxxxxx>
> >
> > In addition to using 'receive.maxInputSize' to limit the overall size
> > of the received packfile, a new config variable
> > 'receive.maxInputObjectSize' is added to limit the push of a single
> > object larger than this threshold.
> 
> Maybe an unfair knee-jerk reaction: I think we should really be pushing
> this sort of thing into pre-receive hooks and/or the proc-receive hook,
> i.e. see 15d3af5e22e (receive-pack: add new proc-receive hook,
> 2020-08-27).

Yes, this could be done as a hook, but it's actually kind of tricky.
We have a similar max-object-size limit in our fork at GitHub, and it's
implemented inside index-pack (we don't have a match in unpack-objects,
but we always run index-pack on incoming packs).

Unlike the patches here, ours is limiting the logical size of an object
(so a 100MB blob limit is on the uncompressed size). It happens in
sha1_object(), where we have that size.

The main reason we put it there is for performance. index-pack is
already computing the size, so it's free to check it there. But it does
make some things awkward. Rather than just dying with "woah, some object
is too big", it's nice to tell the user "hey, the object 1234abcd at
foo.bin is too big". To do that, we actually collect the list of too-big
objects and index them anyway (remember that we're in the quarantine
directory, so nothing is permanent until later). We write the list to a
".large-objects" file in the quarantine directory. And then hooks are
free to produce a much nicer message (e.g., by running something like
"log --find-objects" to get the path name at which we see the blob).

It would be nice if the hook could just find the objects itself, but
there are some gotchas:

  - finding the list of pushed objects is awkward and/or expensive. You
    can use rev-list, but that's very costly, as it has to inflate all
    of the new trees. You really just want the list of what's in the
    quarantine directory, but there's no single command to get that. You
    have to run show-index on the incoming pack, plus poke through the
    loose object directories.

    It would be much easier if "cat-file --batch-all-objects" could be
    asked to only look at local objects. That would also allow some
    other optimizations to reduce the cost. For instance, even when
    iterating packs, cat-file then feeds each sha1 back to
    oid_object_info(), even though we already know the exact pack/offset
    where it can be found. Likewise, it could be taught some basic
    filtering (like "show only objects with size > X") to avoid dumping
    millions of oid/size pairs to a separate script to do the filtering.

    But all of that would just be approaching the speed of having
    index-pack do the check, since it has the size already there.

  - it wouldn't help index-pack at all with memory attacks by malicious
    pushers. Here somebody accidentally pushed up a big blob, and it
    caused unpack-objects to OOM. But I also remember a problem ages ago
    where some of the fsck_tree() code had an integer overflow
    vulnerability, which could only be triggered by a gigantic tree.
    In the patch we use at GitHub, we allow large blobs to make it to
    the hook level, but too-large trees, commits, and tags are
    unceremoniously aborted with an immediate die(). Real users
    accidentally try to push 2GB objects. Trees of that size are no
    accident. :)

    Having index-pack enforce size limits helps a bit there. And
    streaming large blobs helps protect against accidents. But there are
    still OOM problems with maliciously constructed inputs, because the
    delta code only works on full buffers (so it really wants to
    construct the whole object before we get to the sha1_object()
    limits). It would be nice if this could stream the output, but I
    suspect it would require pretty major surgery to the delta code.

    Instead, we've put in a crude limit by having receive-pack set
    GIT_ALLOC_LIMIT in the environment, which then ensures that its
    child index-pack won't allocate too much (the limit we use is much
    higher than the more user-facing object limit, because the point is
    just to avoid DoS attacks, and not enforce any kind of policy).

    It might be nice to have a config option for setting this limit (and
    perhaps even putting it at a reasonable defensive default).

So I do think there are some interesting paths forward for doing more of
this in hooks, but there's work to be done to get there.

> The latter pre-dates c08db5a2d0d (receive-pack: allow a maximum input
> size to be specified, 2016-08-24), which may or may not be relevant
> here.

Yeah, I introduced that limit. It wasn't really about object size or
memory, but just about the fact that without it, a malicious pusher can
just send you bytes forever which the server will write to disk. It also
helps with the most naive kind of large-object attacks, but it wouldn't
help with a cleverly constructed delta.

It is, unfortunately, a pretty unceremonious die(), and has caused more
than one support ticket (especially because GitHub has used 2GB for the
limit for ages, and the practical limit for repositories has been
growing).

So there. That's probably more than anybody wanted to know about push
limits. ;) I'm happy to share any of the code from GitHub's fork in
these areas. The main reason I haven't is just that some of it is just
kind of ugly and may need polishing (e.g., the whole "write to
.large-objects" is really an awful interface).

-Peff