Re: [RFC bpf-next v3 4/5] bpf: Add a new dynptr type for CGRUP_SOCKOPT.

Martin KaFai Lau <martin.lau@xxxxxxxxx> · Thu, 17 Aug 2023 15:56:40 -0700

On 8/17/23 2:37 PM, Yonghong Song wrote:

On 8/17/23 1:41 PM, Martin KaFai Lau wrote:
On 8/16/23 6:25 PM, Alexei Starovoitov wrote:
But I think we have to step back. Why do we need this whole thing in the 
first place?
_why_  sockopt bpf progs needs to read and write user memory?

Yes there is one page limit, but what is the use case to actually read and write
beyond that? iptables sockopt was mentioned, but I don't think bpf prog can do
anything useful with iptables binary blobs. They are hard enough for kernel 
to parse.

Usually the bpf prog is only interested in a very small number of optnames and 
no need to read the optval at all for most cases. The max size for our use 
cases is 16 bytes. The kernel currently is kind of doing it the opposite and 
always assumes the bpf prog needing to use the optval, allocate kernel memory 
and copy_from_user such that the non-sleepable bpf program can read/write it.

The bpf prog usually checks the optname and then just returns for most cases:

SEC("cgroup/getsockopt")
int get_internal_sockopt(struct bpf_sockopt *ctx)
{
     if (ctx->optname != MY_INTERNAL_SOCKOPT)
         return 1;

     /* change the ctx->optval and return to user space ... */
}

When the optlen is > PAGE_SIZE, the kernel only allocates PAGE_SIZE memory and 
copy the first PAGE_SIZE data from the user optval. We used to ask the bpf 
prog to explicitly set the optlen to 0 for > PAGE_SIZE case even it has not 
looked at the optval. Otherwise, the kernel used to conclude that the bpf prog 
had set an invalid optlen because optlen is larger than the optval_end - 
optval and returned -EFAULT incorrectly to the end-user.

The bpf prog started doing this > PAGE_SIZE check and set optlen = 0 due to an 
internal kernel PAGE_SIZE limitation:

SEC("cgroup/getsockopt")
int get_internal_sockopt(struct bpf_sockopt *ctx)
{
     if (ctx->optname != MY_INTERNAL_SOCKOPT) {
         /* only do that for ctx->optlen > PAGE_SIZE.
          * otherwise, the following cgroup bpf prog will
          * not be able to use the optval that it may
          * be interested.
          */
         if (ctx->optlen > PAGE_SIZE)
             ctx->optlen = 0;
         return 1;
     }

     /* change the ctx->optval and return to user space ... */
}

The above has been worked around in commit 29ebbba7d461 ("bpf: Don't EFAULT 
for {g,s}setsockopt with wrong optlen").

Later, it was reported that an optname (NETLINK_LIST_MEMBERSHIPS) that the 
kernel allows a user passing NULL optval and using the optlen returned by 
getsockopt to learn the buffer space required. The bpf prog then needs to 
remove the optlen > PAGE_SIZE check and set optlen to 0 for _all_ optnames 
that it is not interested while risking the following cgroup prog may not be 
able to use some of the optval:

SEC("cgroup/getsockopt")
int get_internal_sockopt(struct bpf_sockopt *ctx)
{
     if (ctx->optname != MY_INTERNAL_SOCKOPT) {

         /* Do that for all optname that you are not interested.
          * The latter cgroup bpf will not be able to use the optval.
          */
          ctx->optlen = 0;
         return 1;
     }

     /* chage the ctx->optval and return to user space ... */
}

The above case has been addressed in commit 00e74ae08638 ("bpf: Don't EFAULT 
for getsockopt with optval=NULL").

To avoid other potential optname cases that may run into similar situation and 
requires the bpf prog work around it again like the above, it needs a way to 
track whether a bpf prog has changed the optval in runtime. If it is not 
changed, use the result from the kernel 

Can we add a field in bpf_sockopt uapi struct so bpf_prog can set it
if optval is changed?

This new interface should work. If there is an old-existing prog staying with 
the old interface (didn't set this bool but changed the optval) in the cgroup 
prog array, it probably end up not improving anything also?

or the verifier can enforce setting this bool in runtime when writing to optval? 
do not know how demanding the verifier change is. I am not sure if this would be 
an overkill for the verifier.

struct bpf_sockopt {
         __bpf_md_ptr(struct bpf_sock *, sk);
         __bpf_md_ptr(void *, optval);
         __bpf_md_ptr(void *, optval_end);

         __s32   level;
         __s32   optname;
         __s32   optlen;
         __s32   retval;
};

set/getsockopt. If reading/writing to optval is done through a kfunc, this can 
be tracked. The kfunc can also directly read/write the user memory in optval, 
avoid the pre-alloc kernel memory and the PAGE_SIZE limit but this is a minor 
point.