On Tue, Apr 25, 2023 at 2:28 PM Kui-Feng Lee <sinquersw@xxxxxxxxx> wrote: > > > > On 4/25/23 11:42, Stanislav Fomichev wrote: > > On Tue, Apr 25, 2023 at 10:59 AM Kui-Feng Lee <sinquersw@xxxxxxxxx> wrote: > >> > >> > >> > >> On 4/18/23 09:47, Stanislav Fomichev wrote: > >>> On 04/17, Martin KaFai Lau wrote: > >>>> On 4/14/23 6:55 PM, Stanislav Fomichev wrote: > >>>>> On 04/13, Stanislav Fomichev wrote: > >>>>>> On Thu, Apr 13, 2023 at 7:38 AM Aleksandr Mikhalitsyn > >>>>>> <aleksandr.mikhalitsyn@xxxxxxxxxxxxx> wrote: > >>>>>>> > >>>>>>> On Thu, Apr 13, 2023 at 4:22 PM Eric Dumazet <edumazet@xxxxxxxxxx> wrote: > >>>>>>>> > >>>>>>>> On Thu, Apr 13, 2023 at 3:35 PM Alexander Mikhalitsyn > >>>>>>>> <aleksandr.mikhalitsyn@xxxxxxxxxxxxx> wrote: > >>>>>>>>> > >>>>>>>>> During work on SO_PEERPIDFD, it was discovered (thanks to Christian), > >>>>>>>>> that bpf cgroup hook can cause FD leaks when used with sockopts which > >>>>>>>>> install FDs into the process fdtable. > >>>>>>>>> > >>>>>>>>> After some offlist discussion it was proposed to add a blacklist of > >>>>>>>> > >>>>>>>> We try to replace this word by either denylist or blocklist, even in changelogs. > >>>>>>> > >>>>>>> Hi Eric, > >>>>>>> > >>>>>>> Oh, I'm sorry about that. :( Sure. > >>>>>>> > >>>>>>>> > >>>>>>>>> socket options those can cause troubles when BPF cgroup hook is enabled. > >>>>>>>>> > >>>>>>>> > >>>>>>>> Can we find the appropriate Fixes: tag to help stable teams ? > >>>>>>> > >>>>>>> Sure, I will add next time. > >>>>>>> > >>>>>>> Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") > >>>>>>> > >>>>>>> I think it's better to add Stanislav Fomichev to CC. > >>>>>> > >>>>>> Can we use 'struct proto' bpf_bypass_getsockopt instead? We already > >>>>>> use it for tcp zerocopy, I'm assuming it should work in this case as > >>>>>> well? > >>>>> > >>>>> Jakub reminded me of the other things I wanted to ask here bug forgot: > >>>>> > >>>>> - setsockopt is probably not needed, right? setsockopt hook triggers > >>>>> before the kernel and shouldn't leak anything > >>>>> - for getsockopt, instead of bypassing bpf completely, should we instead > >>>>> ignore the error from the bpf program? that would still preserve > >>>>> the observability aspect > >>>> > >>>> stealing this thread to discuss the optlen issue which may make sense to > >>>> bypass also. > >>>> > >>>> There has been issue with optlen. Other than this older post related to > >>>> optlen > PAGE_SIZE: > >>>> https://lore.kernel.org/bpf/5c8b7d59-1f28-2284-f7b9-49d946f2e982@xxxxxxxxx/, > >>>> the recent one related to optlen that we have seen is > >>>> NETLINK_LIST_MEMBERSHIPS. The userspace passed in optlen == 0 and the kernel > >>>> put the expected optlen (> 0) and 'return 0;' to userspace. The userspace > >>>> intention is to learn the expected optlen. This makes 'ctx.optlen > > >>>> max_optlen' and __cgroup_bpf_run_filter_getsockopt() ends up returning > >>>> -EFAULT to the userspace even the bpf prog has not changed anything. > >>> > >>> (ignoring -EFAULT issue) this seems like it needs to be > >>> > >>> if (optval && (ctx.optlen > max_optlen || ctx.optlen < 0)) { > >>> /* error */ > >>> } > >>> > >>> ? > >>> > >>>> Does it make sense to also bypass the bpf prog when 'ctx.optlen > > >>>> max_optlen' for now (and this can use a separate patch which as usual > >>>> requires a bpf selftests)? > >>> > >>> Yeah, makes sense. Replacing this -EFAULT with WARN_ON_ONCE or something > >>> seems like the way to go. It caused too much trouble already :-( > >>> > >>> Should I prepare a patch or do you want to take a stab at it? > >>> > >>>> In the future, does it make sense to have a specific cgroup-bpf-prog (a > >>>> specific attach type?) that only uses bpf_dynptr kfunc to access the optval > >>>> such that it can enforce read-only for some optname and potentially also > >>>> track if bpf-prog has written a new optval? The bpf-prog can only return 1 > >>>> (OK) and only allows using bpf_set_retval() instead. Likely there is still > >>>> holes but could be a seed of thought to continue polishing the idea. > >>> > >>> Ack, let's think about it. > >>> > >>> Maybe we should re-evaluate 'getsockopt-happens-after-the-kernel' idea > >>> as well? If we can have a sleepable hook that can copy_from_user/copy_to_user, > >>> and we have a mostly working bpf_getsockopt (after your refactoring), > >>> I don't see why we need to continue the current scheme of triggering > >>> after the kernel? > >> > >> Since a sleepable hook would cause some restrictions, perhaps, we could > >> introduce something like the promise pattern. In our case here, BPF > >> program call an async version of copy_from_user()/copy_to_user() to > >> return a promise. > > > > Having a promise might work. This is essentially what we already do > > with sockets/etc with acquire/release pattern. > > Would you mind to give me some context of the socket things? I'm mostly referring to the infra around KF_ACQUIRE/KF_RELEASE where the verifier already has some resource tracking functionality. We can probably extend it to verify that the program does copy_to_user equivalent at the end of the run (or somehow specifically marks that it's not needed). > > > > What are the sleepable restrictions you're hinting about? I feel like > > with the sleepable bpf, we can also remove all the temporary buffer > > management / extra copies which sounds like a win to me. (we have this > > ugly heuristics with BPF_SOCKOPT_KERN_BUF_SIZE) The program can > > allocate temporary buffers if needed.. > > > >>>>> - or maybe we can even have a per-proto bpf_getsockopt_cleanup call that > >>>>> gets called whenever bpf returns an error to make sure protocols have > >>>>> a chance to handle that condition (and free the fd) > >>>>> > >>>> > >>>>