On 8/17/23 1:41 PM, Martin KaFai Lau wrote:
On 8/16/23 6:25 PM, Alexei Starovoitov wrote:
But I think we have to step back. Why do we need this whole thing in
the first place?
_why_ sockopt bpf progs needs to read and write user memory?
Yes there is one page limit, but what is the use case to actually read
and write
beyond that? iptables sockopt was mentioned, but I don't think bpf
prog can do
anything useful with iptables binary blobs. They are hard enough for
kernel to parse.
Usually the bpf prog is only interested in a very small number of
optnames and no need to read the optval at all for most cases. The max
size for our use cases is 16 bytes. The kernel currently is kind of
doing it the opposite and always assumes the bpf prog needing to use the
optval, allocate kernel memory and copy_from_user such that the
non-sleepable bpf program can read/write it.
The bpf prog usually checks the optname and then just returns for most
cases:
SEC("cgroup/getsockopt")
int get_internal_sockopt(struct bpf_sockopt *ctx)
{
if (ctx->optname != MY_INTERNAL_SOCKOPT)
return 1;
/* change the ctx->optval and return to user space ... */
}
When the optlen is > PAGE_SIZE, the kernel only allocates PAGE_SIZE
memory and copy the first PAGE_SIZE data from the user optval. We used
to ask the bpf prog to explicitly set the optlen to 0 for > PAGE_SIZE
case even it has not looked at the optval. Otherwise, the kernel used to
conclude that the bpf prog had set an invalid optlen because optlen is
larger than the optval_end - optval and returned -EFAULT incorrectly to
the end-user.
The bpf prog started doing this > PAGE_SIZE check and set optlen = 0 due
to an internal kernel PAGE_SIZE limitation:
SEC("cgroup/getsockopt")
int get_internal_sockopt(struct bpf_sockopt *ctx)
{
if (ctx->optname != MY_INTERNAL_SOCKOPT) {
/* only do that for ctx->optlen > PAGE_SIZE.
* otherwise, the following cgroup bpf prog will
* not be able to use the optval that it may
* be interested.
*/
if (ctx->optlen > PAGE_SIZE)
ctx->optlen = 0;
return 1;
}
/* change the ctx->optval and return to user space ... */
}
The above has been worked around in commit 29ebbba7d461 ("bpf: Don't
EFAULT for {g,s}setsockopt with wrong optlen").
Later, it was reported that an optname (NETLINK_LIST_MEMBERSHIPS) that
the kernel allows a user passing NULL optval and using the optlen
returned by getsockopt to learn the buffer space required. The bpf prog
then needs to remove the optlen > PAGE_SIZE check and set optlen to 0
for _all_ optnames that it is not interested while risking the following
cgroup prog may not be able to use some of the optval:
SEC("cgroup/getsockopt")
int get_internal_sockopt(struct bpf_sockopt *ctx)
{
if (ctx->optname != MY_INTERNAL_SOCKOPT) {
/* Do that for all optname that you are not interested.
* The latter cgroup bpf will not be able to use the optval.
*/
ctx->optlen = 0;
return 1;
}
/* chage the ctx->optval and return to user space ... */
}
The above case has been addressed in commit 00e74ae08638 ("bpf: Don't
EFAULT for getsockopt with optval=NULL").
To avoid other potential optname cases that may run into similar
situation and requires the bpf prog work around it again like the above,
it needs a way to track whether a bpf prog has changed the optval in
runtime. If it is not changed, use the result from the kernel
Can we add a field in bpf_sockopt uapi struct so bpf_prog can set it
if optval is changed?
struct bpf_sockopt {
__bpf_md_ptr(struct bpf_sock *, sk);
__bpf_md_ptr(void *, optval);
__bpf_md_ptr(void *, optval_end);
__s32 level;
__s32 optname;
__s32 optlen;
__s32 retval;
};
set/getsockopt. If reading/writing to optval is done through a kfunc,
this can be tracked. The kfunc can also directly read/write the user
memory in optval, avoid the pre-alloc kernel memory and the PAGE_SIZE
limit but this is a minor point.