On Fri, Oct 3, 2014 at 1:42 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > On Fri, Oct 03, 2014 at 01:22:22PM -0700, Andy Lutomirski wrote: >> > So the problem with the default deny is that its: >> > 1) pointless -- the attacker can do sys_perf_event_open() just fine; >> >> Not if the attacker is in a seccomp sandbox. > > Clearly :-) > >> > 2) and expensive -- the people trying to measure performance get the >> > penalty of the CR4 write. >> >> Does this matter for performance measuring? I'm not 100% clear on how all >> the perf_event stuff gets used in practice, but, by my very vague >> understanding, there are two main workflows: >> >> a) perf record, etc: one process creates a ringbuffer and wakes up >> rarely to record the contents. The process being recorded doesn't >> have a perf_event mapped, so the cr4 switch will only happen when >> waking up the perf process. >> >> perf record prints stuff like "[ perf record: Woken up 1 times to >> write data ]", which seems to confirm my understanding. >> >> b) self-monitoring. A task mmaps a perf_event, does rdpmc, does >> something, and does rdpmc again. In that case, there's no context >> switch. > > Agreed, but with the default disable, the self-monitoring task will get > CR4 writes to its context switches and will be slower. > > Whereas if we default on and only disable with seccomp then only the > seccomp tasks will suffer the burden of a CR4 write at context switch > time. > > And I don't see a security implication in the default. I don't see a security implication, but we can fight over whether seccomp performance or perf self monitoring performance is more important. I spent a bunch of time making seccomp *much* faster for 3.18, and things like Chromium (the browser) could have lots of context switched into and out of seccomp. :) I suspect that the overhead only really matters when running as a VM guest. Hmm. Can we switch lazily? I don't really like the idea, but non-seccomp, non-perf-event tasks could, in principle, just inherit the PCE value from whatever had the CPU last. > >> > So I would suggest a default on, but allow a disable for the seccomp >> > users, which might have also disabled the syscall. Note that is is >> > possible to disable RDPMC while still allowing the syscall. >> >> Disabling RDPMC per-process while still allowing the syscall will need >> a bunch of work, right? > > Not much, all you need to do is make > perf_event_mmmap_page::cap_user_rdpmc be 0 and userspace should never > use rdpmc. Currently we set that bit based on the global sysfs rdpmc > value, but we could easy bitwise AND it with a per process value. > >> What happens if the same perf_event is mapped >> by two different users? > > Not entirely clear on what you mean here. Can you open a perf_event fd, fork, and mmap it in the child? What if you map it in the parent and the child? Then cap_user_rdpmc has to match for both mappings. Or is this impossible? > >> We could make the rule be that RDPMC is enabled if a perf event is >> mmapped or TIF_SECCOMP is clear, but I'd prefer to be convinced that >> there's an actual performance issue first. Ideally we can get this >> all working with no API or ABI change at all. > > Well I just want to not have extra CR4 writes by default (and by default > I don't care about seccomp). > >> P.S. Hey, Intel, let us context switch RDPMC accessibility of the >> individual counters, please :) > > Ha sure, make it more complicated why don't you ;-) But yes, that has > come up before. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html