On Thu, Oct 27, 2022 at 1:04 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > > On 10/27/22 12:11 PM, Stanislav Fomichev wrote: > > On Thu, Oct 27, 2022 at 11:48 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > >> > >> On 10/27/22 10:40 AM, Stanislav Fomichev wrote: > >>> On Thu, Oct 27, 2022 at 10:29 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > >>>> > >>>> On 10/27/22 9:23 AM, Stanislav Fomichev wrote: > >>>>> On Wed, Oct 26, 2022 at 11:15 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > >>>>>> > >>>>>> On 10/26/22 7:03 PM, Stanislav Fomichev wrote: > >>>>>>> On Wed, Oct 26, 2022 at 6:14 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > >>>>>>>> > >>>>>>>> The cgroup-bpf {get,set}sockopt prog is useful to change the optname behavior. > >>>>>>>> The bpf prog usually just handles a few specific optnames and ignores most > >>>>>>>> others. For the optnames that it ignores, it usually does not need to change > >>>>>>>> the optlen. The exception is when optlen > PAGE_SIZE (or optval_end - optval). > >>>>>>>> The bpf prog needs to set the optlen to 0 for this case or else the kernel will > >>>>>>>> return -EFAULT to the userspace. It is usually not what the bpf prog wants > >>>>>>>> because the bpf prog only expects error returning to userspace when it has > >>>>>>>> explicitly 'return 0;' or used bpf_set_retval(). If a bpf prog always changes > >>>>>>>> optlen for optnames that it does not care to 0, it may risk if the latter bpf > >>>>>>>> prog in the same cgroup may want to change/look-at it. > >>>>>>>> > >>>>>>>> Would like to explore if there is an easier way for the bpf prog to handle it. > >>>>>>>> eg. does it make sense to track if the bpf prog has changed the ctx->optlen > >>>>>>>> before returning -EFAULT to the user space when ctx.optlen > max_optlen? > >>>>>>> > >>>>>>> Good point on chaining being broken because of this requirement :-/ > >>>>>>> > >>>>>>> With tracking, we need to be careful, because the following situation > >>>>>>> might be problematic: > >>>>>>> Suppose setsockopt is larger than 4k, the program can rewrite some > >>>>>>> byte in the first 4k, not touch optlen and expect this to work. > >>>>>> > >>>>>> If the bpf prog rewrites the first 4k, it must change the ctx.optlen to get it > >>>>>> work. Otherwise, the kernel will return -EFAULT because the ctx.optlen is > >>>>>> larger than the max_optlen (or optval_end - optval). > >>>>>> > >>>>>>> Currently, optlen=0 explicitly means "ignore whatever is in the bpf > >>>>>>> buffer and use the original one" > If we can have a tracking that catches situations like this - we > >>>>>>> should be able to drop that optlen=0 requirement. > >>>>>>> IIRC, that's the only tricky part. > >>>>>> > >>>>>> Ah, I meant, in __cgroup_bpf_run_filter_setsockopt, use "!ctx.optlen_changed && > >>>>>> ctx.optlen > max_optlen" test to imply "ignore whatever is in the bpf > >>>>>> buffer and use the original one". Add 'bool optlen_changed' to 'struct > >>>>>> bpf_sockopt_kern' and set ctx.optlen_changed to true in > >>>>>> cg_sockopt_convert_ctx_access() whenever there is BPF_WRITE to ctx.optlen. > >>>>>> Would it work or may be I am still missing something in the writing first 4k > >>>>>> case above? > >>>>> > >>>>> What if the program wants to keep optlen as is? Here is the > >>>>> hypothetical case: ctx->optlen is 8k, we allocate/expose only the > >>>>> first 4k, the program does ctx->optval[0] = 0xff and doesn't change > >>>>> the optlen. It wants the rest of the payload to be passed as is with > >>>>> only the first byte changed. > >>>> > >>>> I think we are talking about the same case but we may have different > >>>> understanding on how the current __cgroup_bpf_run_filter_setsockopt() is > >>>> handling it. > >>>> > >>>> I don't see the current kernel supports this now. If the bpf prog does not > >>>> change the ctx->optlen from 8k to something that is <= 4k, the kernel will just > >>>> return -EFAULT in here, no? > >>>> else if (ctx.optlen /* 8k */ > max_optlen /* 4k */ || ctx.optlen < -1) { > >>>> /* optlen is out of bounds */ > >>>> ret = -EFAULT; > >>>> } > >>>> > >>>> or you meant the future change needs to consider this new case and also support > >>>> gluing the first 4k (that was exposed to the bpf prog) with the second 4k (that > >>>> was not exposed to the bpf prog)? > >>>> > >>>>> The condition "!ctx.optlen_changed && ctx.optlen > max_optlen" is > >>>>> true, so, if we treat this as explicit optlen=0, we ignore the > >>>>> program's changes. > >>>>> But this is not what the program has intended, right? It wants to > >>>>> amend something and pass the rest as is. > >>> > >>> Right, I'm not talking about how it's handled now. Now optlen > > >>> max_optlen triggers EFAULT. > >>> But in the future, if we add tracking, we want 'optlen > max_optlen' > >>> to behave as explicit 'optlen = 0' as long as the user hasn't changed > >>> the optlen _and_ also hasn't changed anything in the buffer. > >> > >> Ah, ic. > >> > >> Tracking the runtime buffer change will be hard as of the current state through > >> the ctx->optval. I don't think we need to track that either. If the existing > >> bpf prog wants the changed buf to be used, it must have changed the optlen > >> already. Thus, tracking optlen only should be as good. > > > > I might be still missing something on why tracking optlen is enough? > > > > Consider this BPF program: > > > > SEC("cgroup/setsockopt") > > int _setsockopt(struct bpf_sockopt *ctx) > > { > > __u8 *optval_end = ctx->optval_end; > > __u8 *optval = ctx->optval; > > > > if (optval + 1 > optval_end) return 0; > > > > optval[0] = 0xff; > > return 1; > > } > > > > And the userspace doing the following: > > > > __u32 buf[4096*2] = {}; > > ret = setsockopt(fd, SOME_LEVEL, SOME_OPTLEN, &buf, sizeof(buf)); > > > > Right now, without explicit 'optlen = 0' in the BPF program, we'll get > > -1/EFAULT here (unarguably, this is a bad interface, but still better > > than ignoring program's buf?). > > If we track only optlen in the program, we'd get success, but the > > changed buffer will be ignored by the kernel. (what am I missing > > here?) > > Right, this will break if the bpf prog depends on this -EFAULT behavior in > anyway. Similar to my example below, tracking the buffer change still won't be > enough because we don't know the intention of the bpf prog (changed but forgot > to update optlen or it does want to return -EFAULT). > > After these few examples in the thread, I think this optlen and buffer tracking > does not seem to be a tangible path to solve it. It seems like it is only > papering around it. > > > > >> If the bpf prog is depending on the kernel to do implicit -EFAULT like this, > >> yes, it will break even the buffer change is tracked. > >> > >> if (ctx->optlen > ctx->optval_end - ctx->optval) > >> return 1; /* 0 will be -EPERM, so 1 here to make kernel return -EFAULT for > >> us */ > > > > [..] > > > >> I would argue that it is more like a surprise than a feature if the bpf prog > >> depends on ctx.optlen > max_optlen (only for the > 4k case though) to do an > >> implicit reject (through EFAULT) instead of directly using the 'return 0' or > >> bpf_set_retval() which is exactly how it should be done to reject other "normal" > >> integer optval. > > > > That all comes from the issue above. We want to have a contract with > > the bpf program: when optlen>4k, it has to do something with the > > optlen (set it to 0 to ignore, set it to <4096 to pass to the kernel). > > It can't just change something in the 4k of the exposed buffer and > > assume this data will be passed to the kernel. > > > >> I am also not sure how useful it is to expose partial data to the bpf prog and > >> have a way for the bpf prog to tell the kernel to join the remaining. Instead, > >> it would be more useful to have API for the bpf prog to have access to the whole > >> data instead. > > > > That seems like a better way to go? We didn't do that initially > > because the data is in the __user memory and we can't pass it to bpf; > > we had to do this extra copy/allocation :-( I think we decided against > > copying everything because this can be abused due to no sane limit on > > the setsockopt value size. Nothing prevents userspace from passing a > > huge buffer when doing, say, SO_MARK; the kernel will read the first > > int and be happy with it. > > yeah, may be one thing for the future API is to avoid the pre allocation. There > is bpf_copy_from_user but it needs to be sleepable. +1, having a sleepable version might be cleaner alternative > > > >>>>> It seems like we need to have both optlen_changed and optval_changed. > >>>>> If both are false, we should be able to safely do optlen=0 equivalent. > >>>>> Tracking only optlen seems to be problematic? > >>>> > >> >