On Sat, Feb 20, 2021 at 01:31:57AM -0800, Sargun Dhillon wrote: > We've run into a problem where attaching a filter can be quite messy > business because the filter itself intercepts sendmsg, and other > syscalls related to exfiltrating the listener FD. I believe that this > problem set has been brought up before, and although there are > "simpler" methods of exfiltrating the listener, like clone3 or > pidfd_getfd, but these are still less than ideal. (You really like sending patches and discussion points in the middle of the merge window. :D I think everyone's panicked about getting their PRs in shape so it's not unlikely that this sometimes gets lost on the list. :)) It hasn't been a huge problem for us, especially since we added pidfd_getfd() this seemed like a straightforward problem to solve by selecting a fix fd number that is to be used for the listener. But I can see why it is annoying. > > One of the ideas that's been talked about (I want to say back at LSS > NA) is the idea of "delayed activation". I was thinking that it might > be nice to have a mechanism to do delayed attach, either activated on > execve / fork, or an ioctl on the listenerfd to activate the filter > and have a flag like SECCOMP_FILTER_FLAG_NEW_LISTENER_INACTIVE, which > indicates that the listener should be setup, but not enforcing, and > another ioctl to activate it. > > The later approach is preferred due to simplicity, but I can see a > situation where you could accidentally get into a state where the > filter is not being enforced. Additionally, this may have unforeseen > implications with CRIU. (If you were to expose an ioctl() that allows userspace to query the notifer state then CRIU shouldn't have a problem restoring the notifier in the correct state. Right now it doesn't do anyting fancy about the notifier, it just restores the task with the filter. It just has to learn about the new feature and that's fine imho.) > > I'm curious whether this is a problem others share, and whether any of > the aforementioned approaches seem reasonable. So when I originally suggested the delayed activation I I had another related idea that I think I might have mentioned too: if we're already considering delaying filter activation I like to discuss the possibility of attaching a seccomp filter to a task. Right now, if one task wants to attach to another task they need to recreate the whole seccomp filter and load it. That's not just pretty expensive but also only works if you have access to the rules that the filter was generated with. For container that's usually some sort of pseudo seccomp filter configuration language dumped into a config file from which it can be read. So right now the status quo is: struct sock_filter filter[] = { BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)), BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, nr, 0, 1), BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF), /* Get me a listener fd */ BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), }; struct sock_fprog prog = { .len = (unsigned short)ARRAY_SIZE(filter), .filter = filter, }; int fd = seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog); and then the caller must send the fd to the manager or the manager uses pidfd_getfd(). But, why not get a bit crazy^wcreative; especially since seccomp() is already a multiplexer. We introduce a new seccomp flag: #define SECCOMP_FILTER_DETACHED and a new seccomp command: #define SECCOMP_ATTACH_FILTER And now we could do something like: pid_t pid = fork(); if (pid < 0) return; if (pid == 0) { // do stuff BARRIER_WAKE_SETUP_DONE; // do more unrelated stuff BARRIER_WAIT_SECCOMP_FILTER; execve(exec-something); } else { int fd_filter; struct sock_filter filter[] = { BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)), BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, nr, 0, 1), BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), }; struct sock_fprog prog = { .len = (unsigned short)ARRAY_SIZE(filter), .filter = filter, }; int fd_filter = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_DETACHED, &prog); BARRIER_WAIT_SETUP_DONE; int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, INT_TO_PTR(fd_listener)); BARRIER_WAKE_SECCOMP_FILTER; } And now you have attached a filter to another task. This would be super elegant for a container manager. The container manager could also stash the filter fd and when attaching to a container the manager can send the attaching task the fd and the attaching task can do: int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, INT_TO_PTR(fd_filter)); too and would be attached to the same filter as the target task. And for the listener fd case a container manager could simply set SECCOMP_RET_USER_NOTIF as before struct sock_filter filter[] = { BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)), BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, nr, 0, 1), BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF), BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), }; and now fd_filter simply functions as the notifier fd after seccomp(SECCOMP_ATTACH_FILTER) that's basically the fancy version of my delayed notifier activiation idea. I'm sure there's nastiness to figure out but I would love to see something like this. Christian _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers