On Mon, Jun 1, 2020 at 1:08 PM Kees Cook <keescook@xxxxxxxxxxxx> wrote: > > On Sun, May 31, 2020 at 02:03:48PM -0700, Andy Lutomirski wrote: > > On Sun, May 31, 2020 at 11:57 AM Andy Lutomirski <luto@xxxxxxxxxx> wrote: > > > > > > > > > What if there was a special filter type that ran a BPF program on each > > > syscall, and the program was allowed to access user memory to make its > > > decisions, e.g. to look at some list of memory addresses. But this > > > would explicitly *not* be a security feature -- execve() would remove > > > the filter, and the filter's outcome would be one of redirecting > > > execution or allowing the syscall. If the "allow" outcome occurs, > > > then regular seccomp filters run. Obviously the exact semantics here > > > would need some care. > > > > Let me try to flesh this out a little. > > > > A task could install a syscall emulation filter (maybe using the > > seccomp() syscall, maybe using something else). There would be at > > most one such filter per process. Upon doing a syscall, the kernel > > will first do initial syscall fixups (e.g. SYSENTER/SYSCALL32 magic > > argument translation) and would then invoke the filter. The filter is > > an eBPF program (sorry Kees) and, as input, it gets access to the > > FWIW, I agree: something like this needs to use eBPF -- this isn't > being designed as a security boundary. It's more like eBPF ptrace. On a bit more consideration, I think that I have the model a bit wrong. We shouldn't think of this as a *syscall* filter but as a filter for architectural privilege transitions in general. After all, there is no particular guarantee that any given emulated program has a syscall ABI that is even remotely compatible with Linux. So maybe the filter is fed events like SYSCALL64, SYSCALL32, SYSENTER, #GP, #PF (the bad kind that would otherwise get a signal), #UD, etc. And the filter can examine process state and take some reasonable action. Think if it as a personality scheme that's programmable by user code. I imagine that even schemes like NaCl could make some use of this. This allows all kinds of interesting things. For example, it should give Wine a much nicer emulation of Windows SEH and vectored signals. And maybe it could finally allow Linux userspace to have some sensible equivalent of those Windows features -- being able to write library code that could sanely handle, say, math errors would be quite handy. This could be mocked up with cBPF, but I think a cBPF version will struggle to be a performant solution for Wine because it will have a hard time distinguishing between Windows and Linux syscalls. --Andy