Re: [Question]: BPF_CGROUP_{GET,SET}SOCKOPT handling when optlen > PAGE_SIZE

Stanislav Fomichev <sdf@xxxxxxxxxx> · Thu, 27 Oct 2022 10:40:49 -0700

On Thu, Oct 27, 2022 at 10:29 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
>
> On 10/27/22 9:23 AM, Stanislav Fomichev wrote:
> > On Wed, Oct 26, 2022 at 11:15 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
> >>
> >> On 10/26/22 7:03 PM, Stanislav Fomichev wrote:
> >>> On Wed, Oct 26, 2022 at 6:14 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
> >>>>
> >>>> The cgroup-bpf {get,set}sockopt prog is useful to change the optname behavior.
> >>>> The bpf prog usually just handles a few specific optnames and ignores most
> >>>> others.  For the optnames that it ignores, it usually does not need to change
> >>>> the optlen.  The exception is when optlen > PAGE_SIZE (or optval_end - optval).
> >>>> The bpf prog needs to set the optlen to 0 for this case or else the kernel will
> >>>> return -EFAULT to the userspace.  It is usually not what the bpf prog wants
> >>>> because the bpf prog only expects error returning to userspace when it has
> >>>> explicitly 'return 0;' or used bpf_set_retval().  If a bpf prog always changes
> >>>> optlen for optnames that it does not care to 0,  it may risk if the latter bpf
> >>>> prog in the same cgroup may want to change/look-at it.
> >>>>
> >>>> Would like to explore if there is an easier way for the bpf prog to handle it.
> >>>> eg. does it make sense to track if the bpf prog has changed the ctx->optlen
> >>>> before returning -EFAULT to the user space when ctx.optlen > max_optlen?
> >>>
> >>> Good point on chaining being broken because of this requirement :-/
> >>>
> >>> With tracking, we need to be careful, because the following situation
> >>> might be problematic:
> >>> Suppose setsockopt is larger than 4k, the program can rewrite some
> >>> byte in the first 4k, not touch optlen and expect this to work.
> >>
> >> If the bpf prog rewrites the first 4k, it must change the ctx.optlen to get it
> >> work.  Otherwise, the kernel will return -EFAULT because the ctx.optlen is
> >> larger than the max_optlen (or optval_end - optval).
> >>
> >>> Currently, optlen=0 explicitly means "ignore whatever is in the bpf
> >>> buffer and use the original one" > If we can have a tracking that catches situations like this - we
> >>> should be able to drop that optlen=0 requirement.
> >>> IIRC, that's the only tricky part.
> >>
> >> Ah, I meant, in __cgroup_bpf_run_filter_setsockopt, use "!ctx.optlen_changed &&
> >> ctx.optlen > max_optlen" test to imply "ignore whatever is in the bpf
> >> buffer and use the original one".  Add 'bool optlen_changed' to 'struct
> >> bpf_sockopt_kern' and set ctx.optlen_changed to true in
> >> cg_sockopt_convert_ctx_access() whenever there is BPF_WRITE to ctx.optlen.
> >> Would it work or may be I am still missing something in the writing first 4k
> >> case above?
> >
> > What if the program wants to keep optlen as is? Here is the
> > hypothetical case: ctx->optlen is 8k, we allocate/expose only the
> > first 4k, the program does ctx->optval[0] = 0xff and doesn't change
> > the optlen. It wants the rest of the payload to be passed as is with
> > only the first byte changed.
>
> I think we are talking about the same case but we may have different
> understanding on how the current __cgroup_bpf_run_filter_setsockopt() is
> handling it.
>
> I don't see the current kernel supports this now.  If the bpf prog does not
> change the ctx->optlen from 8k to something that is <= 4k, the kernel will just
> return -EFAULT in here, no?
>         else if (ctx.optlen /* 8k */ > max_optlen /* 4k */ || ctx.optlen < -1) {
>                 /* optlen is out of bounds */
>                  ret = -EFAULT;
>          }
>
> or you meant the future change needs to consider this new case and also support
> gluing the first 4k (that was exposed to the bpf prog) with the second 4k (that
> was not exposed to the bpf prog)?
>
> > The condition "!ctx.optlen_changed && ctx.optlen > max_optlen" is
> > true, so, if we treat this as explicit optlen=0, we ignore the
> > program's changes.
> > But this is not what the program has intended, right? It wants to
> > amend something and pass the rest as is.

Right, I'm not talking about how it's handled now. Now optlen >
max_optlen triggers EFAULT.
But in the future, if we add tracking, we want 'optlen > max_optlen'
to behave as explicit 'optlen = 0' as long as the user hasn't changed
the optlen _and_ also hasn't changed anything in the buffer.

> > It seems like we need to have both optlen_changed and optval_changed.
> > If both are false, we should be able to safely do optlen=0 equivalent.
> > Tracking only optlen seems to be problematic?
>