On Fri, Jun 12, 2020 at 06:03:03PM -0700, Stanislav Fomichev wrote: > On Fri, Jun 12, 2020 at 5:34 PM Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > On Mon, Jun 08, 2020 at 11:27:47AM -0700, Stanislav Fomichev wrote: > > > Attaching to these hooks can break iptables because its optval is > > > usually quite big, or at least bigger than the current PAGE_SIZE limit. > > > David also mentioned some SCTP options can be big (around 256k). > > > > > > There are two possible ways to fix it: > > > 1. Increase the limit to match iptables max optval. There is, however, > > > no clear upper limit. Technically, iptables can accept up to > > > 512M of data (not sure how practical it is though). > > > > > > 2. Bypass the value (don't expose to BPF) if it's too big and trigger > > > BPF only with level/optname so BPF can still decide whether > > > to allow/deny big sockopts. > > > > > > The initial attempt was implemented using strategy #1. Due to > > > listed shortcomings, let's switch to strategy #2. When there is > > > legitimate a real use-case for iptables/SCTP, we can consider increasing > > > the PAGE_SIZE limit. > > > > > > v3: > > > * don't increase the limit, bypass the argument > > > > > > v2: > > > * proper comments formatting (Jakub Kicinski) > > > > > > Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") > > > Cc: David Laight <David.Laight@xxxxxxxxxx> > > > Signed-off-by: Stanislav Fomichev <sdf@xxxxxxxxxx> > > > --- > > > kernel/bpf/cgroup.c | 18 ++++++++++++++---- > > > 1 file changed, 14 insertions(+), 4 deletions(-) > > > > > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > > > index fdf7836750a3..758082853086 100644 > > > --- a/kernel/bpf/cgroup.c > > > +++ b/kernel/bpf/cgroup.c > > > @@ -1276,9 +1276,18 @@ static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp, > > > > > > static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen) > > > { > > > - if (unlikely(max_optlen > PAGE_SIZE) || max_optlen < 0) > > > + if (unlikely(max_optlen < 0)) > > > return -EINVAL; > > > > > > + if (unlikely(max_optlen > PAGE_SIZE)) { > > > + /* We don't expose optvals that are greater than PAGE_SIZE > > > + * to the BPF program. > > > + */ > > > + ctx->optval = NULL; > > > + ctx->optval_end = NULL; > > > + return 0; > > > + } > > > > It's probably ok, but makes me uneasy about verifier consequences. > > ctx->optval is PTR_TO_PACKET and it's a valid pointer from verifier pov. > > Do we have cases already where PTR_TO_PACKET == PTR_TO_PACKET_END ? > > I don't think we have such tests. I guess bpf prog won't be able to read > > anything and nothing will crash, but having PTR_TO_PACKET that is > > actually NULL would be an odd special case to keep in mind for everyone > > who will work on the verifier from now on. > > > > Also consider bpf prog that simply reads something small like 4 bytes. > > IP_FREEBIND sockopt (like your selftest in the patch 2) will have > > those 4 bytes, so it's natural for the prog to assume that it can read it. > > It will have > > p = ctx->optval; > > if (p + 4 > ctx->optval_end) > > /* goto out and don't bother logging, since that never happens */ > > *(u32*)p; > > > > but 'clever' user space would pass long optlen and prog suddenly > > 'not seeing' the sockopt. It didn't crash, but debugging would be > > surprising. > > > > I feel it's better to copy the first 4k and let the program see it. > Agreed with the IP_FREEBIND example wrt observability, however it's > not clear what to do with the cropped buffer if the bpf program > modifies it. > > Consider that huge iptables setsockopts where the usespace passes > PAGE_SIZE*10 optlen with real data and bpf prog sees only part of it. > Now, if the bpf program modifies the buffer (say, flips some byte), we > are back to square one. We either have to silently discard that buffer > or reallocate/merge. My reasoning with data == NULL, is that at least > there is a clear signal that the program can't access the data (and > can look at optlen to see if the original buffer is indeed non-zero > and maybe deny such requests?). > At this point I'm really starting to think that maybe we should just > vmalloc everything that is >PAGE_SIZE and add a sysclt to limit an > upper bound :-/ > I'll try to think about this a bit more over the weekend. Yeah. Tough choices. We can also detect in the verifier whether program accessed ctx->optval and skip alloc/copy if program didn't touch it, but I suspect in most case the program would want to read it. I think vmallocing what optlen said is DoS-able. It's better to stick with single page. Let's keep brainstorming.