On Tue, Jan 31, 2012 at 4:04 AM, Will Drewry <wad@xxxxxxxxxxxx> wrote: > On Mon, Jan 30, 2012 at 7:42 PM, Indan Zupancic <indan@xxxxxx> wrote: >> On Mon, January 30, 2012 23:26, Will Drewry wrote: >>> Do you think something along the lines of 2 kB is sane for a config-less change? >> >> Yes, especially if there is some way to get rid of it anyway, >> like disabling SECCOMP or some option under CONFIG_EMBEDDED. >> But it seems you need at least a hidden config option which >> depends on the stuff you need. > > Disabling SECCOMP would definitely do it. > >>> >>> Doing exactly that. I've been tinkering with the best way to minimize >>> the impact to the existing BPF evaluator. Right now, I'm adding a >>> very small number of new instructions to the sk_buff specific code >>> path, but I haven't yet benchmarked - just disasssembled. >> >> I would do all the checking in sk_chk_filter(), so you know that when >> you do run the filter, you can't hit the sk_buff paths. This doesn't >> cause any slow down for the networking path. > > Ah sorry - I was referring to the intrusion of a load_pointer function > pointer. I want to leave the current networking path as untouched as > possible. For checking, I agree -- a quick change to sk_chk_filter or > even just a small helper function that scans for any codes in the > ancillary range will do the trick. I see now. I can do all the fixup in sk_chk_filter. Clever! Sorry for not catching on faster :) > >>> I agree. I will post the next series with a proposed integration. If >>> there is a lot of resistance, then the difference will be going from a >>> 2kB changes to a 3kB change. >> >> Let's see how it goes. >> >>>> I think you should go on a quest to make sure (almost) all archs have that file, >>>> before this patch can be merged. At least the archs that have ptrace support. >>> >>> I'm an idiot. CONFIG_HAVE_ARCH_TRACEHOOK covers asm/syscall.h >>> >>> So I have two choices: >>> 1. allow seccomp with filtering on these platforms by fail if an >>> argument is accessed >>> 2. return ENOSYS when a filter is attempted to be installed on >>> platforms with no tracehook support. >> >> I vote for: >> >> 3. Add tracehook support to all archs. > > I don't see these #3 as mutually exclusive :) tracehook requires: > - task_pt_regs() in asm/processor.h or asm/ptrace.h > - arch_has_single_step() if there is hardware single-step support > - arch_has_block_step() if there is hardware block-step support > - asm/syscall.h supplying asm-generic/syscall.h interface > - linux/regset.h user_regset interfaces > - CORE_DUMP_USE_REGSET #define'd in linux/elf.h > -TIF_SYSCALL_TRACE calls tracehook_report_syscall_{entry,exit} > - TIF_NOTIFY_RESUME calls tracehook_notify_resume() > - signal delivery calls tracehook_signal_handler() > >> Maybe not all archs, but at least some more. That way, every time someone >> adds something tracehook specific, more archs support it. > > Well the other arch I want this on specifically for my purposes is > arm, but someone recently posted a partial asm/syscall.h for arm, but > I didn't see that one go anywhere just yet. (I know syscall_get_nr > can be tricky on arm because it doesn't (didn't) have a way of > tracking in-syscall state.) > > ref: https://lkml.org/lkml/2011/12/1/131 > >> syscall.h has no TRACEHOOK defines or anything though. > > Nope - it is just part of what is expected. > >> Only syscall_rollback() looks tricky. I have no clue what the difference >> between syscall_get_error() and syscall_get_return_value() is. But you >> only need to add syscall_get_nr() and syscall_[gs]et_arguments(), which >> should be possible for all archs. > > It seems even syscall_get_nr can have some wrinkles :) > > ref: http://lkml.indiana.edu/hypermail/linux/kernel/0906.3/00096.html > >> How many archs don't support tracehook? > > 14 out of 26. However, 5 of those still have asm/syscall.h > >>> I think #2 is the nicest user contract, but #1 allows working filters >>> even on less hospitable arches even if they can't user arguments >>> (yet). I'm coding it up as #2, but it is easy to switch to #1. >> >> If you don't support the arch, don't compile any code at all, and >> let prctl(2) return EINVAL. You don't want to return ENOSYS. > > I was thinking of the inline prctl handler, but EINVAL makes sense. > >>>> Yeah, I figured that out later on. It's quite nifty, but I find the recursion >>>> within kref_put() slightly worrisome. Perhaps the code would be cleaner if this >>>> was avoided and done differently, but I can't think of a good alternative. I'll >>>> wait for the new version to see if I can find a way. >>> >>> Thanks - sure. Since kref_put is just an atomic_dec_and_test followed >>> by a call to the function pointer, I wasn't too concerned. Without >>> changing how the relationships are handled, I'm not sure how to >>> approach it differently (and still avoid races). IIRC, this isn't much >>> different to how namespaces work, they just use their own atomic >>> counters. >> >> Well, the thing is, this recursion is controlled by user space depending >> on how many filters they have installed. What is preventing them to force >> you out of stack? > > Hrm true. The easiest option is to just convert it to iterative by > not using kref_t, but I'll look more closely. > >> So perhaps add at least some arbitrary filter limit to avoid this? > > Definitely possible -- probably as a sysctl. I'm not quite sure what > number makes sense yet, but I'll look at breaking the recursion first. > Thanks! > >>> I'll clarify a bit. My original ptrace integration worked such that a >>> tracer may only intercept syscall failures if it attached prior to the >>> failing filter being installed. I did it this way to avoid using >>> ptrace to escape system call filtering. However, since I don't have >>> that as part of the patch series, it doesn't make sense to keep it. (I >>> tracked a tracer pid_struct in the filters.) If it needs to come back >>> with later patchsets, then it can be tackled then! >> >> The problem of that is that filters can be shared between processes with >> different ptracers. And you have all the hassle of keeping it up to date. >> >> I think seccomp should always come first and just trust ptrace. This >> because it can deny all ptrace() calls for filtered tasks, so the only >> untrusted tasks doing ptrace() are outside of seccomp's filtering control. >> And those could do the same system calls themselves. >> >> The case where there is one task being filtered and allowed to do ptrace, >> but not some other syscall, ptracing another filtered task which isn't >> allowed to do ptrace, but allowed to do that other syscall, is quite far >> fetched I think. If you really want to handle this, then you could run >> the ptracer's filters against the tracee's post-ptrace syscall state. >> This is best done in the ptracer's context, just before continuing the >> system call. (You really want Oleg's SIKILL immediate patch then.) >> >> What about: >> >> 1) Seccomp filters can deny a syscall by killing the task. >> >> 2) Seccomp can deny a syscall and let it return an error value. >> >> I know you're not fond of this one. It's just a performance >> optimisation as sometimes a lot of denied but harmless syscalls >> are called by glibc all the time, like getxattr(). Hardcoding >> the kill policy seems wrong when it can be avoided. If this is >> too hard then skip it, but if it's easy to add then please do. >> I'll take a look at this with your next patch version. > > It's easy on x86 harder on other arches. I would suggest saving > changing the __secure_computing signature until after the core > functionality lands, but that's just me. > >> 3) Seccomp can allow a syscall to proceed normally. >> >> 4) Seccomp can set a hint to skip ptrace syscall events for this syscall. >> A filter sets this by returning a specific value. >> >> 5) Ptrace always gets a syscall event when it asked for it. >> >> 6) Ptrace can set an option to honour seccomp's hint and to not get all >> syscall events. >> >> This way all seccomp needs to do is to set some flags which ptrace can check. > > I like the use of flags/options to trigger ptrace handling. If I were > to stack rank these for pursuit after the core functionality lands, > it'd be to add #6 (and its deps) then #2. With #6, #2 can be > simulated (by having a supervisor that changes the syscall number to > -1), but that is much less ideal than just returning SECCOMP_ERROR > instead of SECCOMP_ALLOW/DENY and letting an error code get bubbled > up. > > thanks! > will -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html