On 10/27/22 1:46 PM, Andrii Nakryiko wrote:
On Wed, Oct 26, 2022 at 6:17 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
The cgroup-bpf {get,set}sockopt prog is useful to change the optname behavior.
The bpf prog usually just handles a few specific optnames and ignores most
others. For the optnames that it ignores, it usually does not need to change
the optlen. The exception is when optlen > PAGE_SIZE (or optval_end - optval).
The bpf prog needs to set the optlen to 0 for this case or else the kernel will
return -EFAULT to the userspace. It is usually not what the bpf prog wants
because the bpf prog only expects error returning to userspace when it has
explicitly 'return 0;' or used bpf_set_retval(). If a bpf prog always changes
optlen for optnames that it does not care to 0, it may risk if the latter bpf
prog in the same cgroup may want to change/look-at it.
Would like to explore if there is an easier way for the bpf prog to handle it.
eg. does it make sense to track if the bpf prog has changed the ctx->optlen
before returning -EFAULT to the user space when ctx.optlen > max_optlen?
Maybe optlen + optval/optval_end could be replaced with dynptr? If we
do a new type of dynptr (DYNPTR_CTXBUF or something like that), we can
implement tracking of whether it was ever modified through
bpf_dynptr_write() or if direct memory access was ever used (was
bpf_dynptr_data() called). Not sure how you'd know if
bpf_dynptr_data() was used to modify data, though (this is where
bpf_dynptr_data_rdonly() vs bpf_dynptr_data() would be helpful,
perhaps). But just a seed of an idea, maybe you guys can somehow fit
it here?
Yep, dynptr can be used here. May be one dynptr specifically for sockptr_t. The
dynptr will have the __user pointer first to avoid a big (and potentially bogus)
kernel buffer pre allocation. Then only read and write it through the
bpf_dynptr_read() and write(). Since it is __user pointer, it won't be directly
accessible through data slice with bpf_dynptr_data(). It should not be a perf
issue, most of the common optval is an integer. The worst common case is the 16
bytes tcp-cc name. The bpf prog has to be sleepable first though which I think
will be a useful thing to have.