On Wed, Aug 12, 2020 at 12:25:43PM +0200, Paolo Bonzini wrote: > On 12/08/20 07:07, Like Xu wrote: > > To emulate PMC counter for guest, KVM would create an > > event on the host with 'exclude_guest=0, exclude_hv=0' > > which simply makes no sense and is utterly broken. > > > > To keep perf semantics consistent, any event created by > > pmc_reprogram_counter() should both set exclude_hv and > > exclude_host in the KVM context. > > > > Message-ID: <20200811084548.GW3982@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> > > Signed-off-by: Like Xu <like.xu@xxxxxxxxxxxxxxx> > > --- > > arch/x86/kvm/pmu.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c > > index 67741d2a0308..6a30763a10d7 100644 > > --- a/arch/x86/kvm/pmu.c > > +++ b/arch/x86/kvm/pmu.c > > @@ -108,6 +108,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, > > .exclude_host = 1, > > .exclude_user = exclude_user, > > .exclude_kernel = exclude_kernel, > > + .exclude_hv = 1, > > .config = config, > > }; > > > > > > x86 does not have a hypervisor privilege level, so it never uses Arguably it does when Xen, but I don't think we support that, so *phew*. > exclude_hv; exclude_host already excludes all root mode activity for > both ring0 and ring3. Right, but we want to tighten the permission checks and not excluding_hv is just sloppy. That said; the exclude_host / exclude_guest thing is a giant trainwreck and I'm really not sure what to do about it. The thing is, we very much do not want to allow unpriv user to be able to create: exclude_host=1, exclude_guest=0 counters (they currently can). So we really want to add: if ((!exclude_host || !exclude_guest || !exclude_hv) && !perf_allow_kernel()) return -EACCESS; But the problem is, they were added late, so lots of userspace will not be setting those fields (or might not have even known about them), so we got to somehow deal with: exclude_host == exclude_guest == 0 And because of that, we now also have genius code like (intel_set_masks, amd_core_hw_config has even 'funnier' code): if (event->attr.exclude_host) __set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_guest_mask); if (event->attr.exclude_guest) __set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_host_mask); Which is just confusing and bad, that really should've been: if (!event->attr.exclude_host) __set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_host_mask); if (!event->attr.exclude_guest) __set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_guest_mask); :-( Also, exclude_host is really poorly defined: https://lkml.kernel.org/r/20200806091827.GY2674@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx "Suppose we have nested virt: L0-hv | G0/L1-hv | G1 And we're running in G0, then: - 'exclude_hv' would exclude L0 events - 'exclude_host' would ... exclude L1-hv events? - 'exclude_guest' would ... exclude G1 events? Then the next question is, if G0 is a host, does the L1-hv run in G0 userspace or G0 kernel space? I was assuming G0 userspace would not include anything L1 (kvm is a kernel module after all), but what do I know." The way it is implemented, you basically have to always set exclude_host=0, even if there is no virt at all and you want to measure your own userspace thing -- which is just weird. Meanwhile ARM64 couldn't quite figure out what it was all supposed to be either and also implemented something -- and i've not tried to understand what exactly, but hopefully compatible enough that we're not in an even worse corner. So on the one hand we're now leaking all sorts due to lack of permission checks, on the other hand we can't fix it because we're not allowed to break userspace :-( I suppose the 'best' option at this point is something like: /* * comment that explains the trainwreck. */ if (!exclude_host && !exclude_guest) exclude_guest = 1; if ((!exclude_hv || !exclude_guest) && !perf_allow_kernel()) return -EPERM; But that takes away the possibility of actually having: 'exclude_host=0, exclude_guest=0' to create an event that measures both, which also sucks.