On Mon, Feb 14, 2022 at 1:55 PM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote: > > > > On 2/12/2022 6:31 PM, Jim Mattson wrote: > > On Fri, Feb 11, 2022 at 1:47 PM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote: > >> > >> > >> > >> On 2/11/2022 1:08 PM, Jim Mattson wrote: > >>> On Fri, Feb 11, 2022 at 6:11 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote: > >>>> > >>>> > >>>> > >>>> On 2/10/2022 2:55 PM, David Dunn wrote: > >>>>> Kan, > >>>>> > >>>>> On Thu, Feb 10, 2022 at 11:46 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote: > >>>>> > >>>>>> No, we don't, at least for Linux. Because the host own everything. It > >>>>>> doesn't need the MSR to tell which one is in use. We track it in an SW way. > >>>>>> > >>>>>> For the new request from the guest to own a counter, I guess maybe it is > >>>>>> worth implementing it. But yes, the existing/legacy guest never check > >>>>>> the MSR. > >>>>> > >>>>> This is the expectation of all software that uses the PMU in every > >>>>> guest. It isn't just the Linux perf system. > >>>>> > >>>>> The KVM vPMU model we have today results in the PMU utilizing software > >>>>> simply not working properly in a guest. The only case that can > >>>>> consistently "work" today is not giving the guest a PMU at all. > >>>>> > >>>>> And that's why you are hearing requests to gift the entire PMU to the > >>>>> guest while it is running. All existing PMU software knows about the > >>>>> various constraints on exactly how each MSR must be used to get sane > >>>>> data. And by gifting the entire PMU it allows that software to work > >>>>> properly. But that has to be controlled by policy at host level such > >>>>> that the owner of the host knows that they are not going to have PMU > >>>>> visibility into guests that have control of PMU. > >>>>> > >>>> > >>>> I think here is how a guest event works today with KVM and perf subsystem. > >>>> - Guest create an event A > >>>> - The guest kernel assigns a guest counter M to event A, and config > >>>> the related MSRs of the guest counter M. > >>>> - KVM intercepts the MSR access and create a host event B. (The > >>>> host event B is based on the settings of the guest counter M. As I said, > >>>> at least for Linux, some SW config impacts the counter assignment. KVM > >>>> never knows it. Event B can only be a similar event to A.) > >>>> - Linux perf subsystem assigns a physical counter N to a host event > >>>> B according to event B's constraint. (N may not be the same as M, > >>>> because A and B may have different event constraints) > >>>> > >>>> As you can see, even the entire PMU is given to the guest, we still > >>>> cannot guarantee that the physical counter M can be assigned to the > >>>> guest event A. > >>> > >>> All we know about the guest is that it has programmed virtual counter > >>> M. It seems obvious to me that we can satisfy that request by giving > >>> it physical counter M. If, for whatever reason, we give it physical > >>> counter N isntead, and M and N are not completely fungible, then we > >>> have failed. > >>> > >>>> How to fix it? The only thing I can imagine is "passthrough". Let KVM > >>>> directly assign the counter M to guest. So, to me, this policy sounds > >>>> like let KVM replace the perf to control the whole PMU resources, and we > >>>> will handover them to our guest then. Is it what we want? > >>> > >>> We want PMU virtualization to work. There are at least two ways of doing that: > >>> 1) Cede the entire PMU to the guest while it's running. > >> > >> So the guest will take over the control of the entire PMUs while it's > >> running. I know someone wants to do system-wide monitoring. This case > >> will be failed. > > > > We have system-wide monitoring for fleet efficiency, but since there's > > nothing we can do about the efficiency of the guest (and those cycles > > are paid for by the customer, anyway), I don't think our efficiency > > experts lose any important insights if guest cycles are a blind spot. > > Others, e.g., NMI watchdog, also occupy a performance counter. I think > the NMI watchdog is enabled by default at least for the current Linux > kernel. You have to disable all such cases in the host when the guest is > running. It doesn't actually make any sense to run the NMI watchdog while in the guest, does it? > > > >> I'm not sure whether you can fully trust the guest. If malware runs in > >> the guest, I don't know whether it will harm the entire system. I'm not > >> a security expert, but it sounds dangerous. > >> Hope the administrators know what they are doing when choosing this policy. > > > > Virtual machines are inherently dangerous. :-) > > > > Despite our concerns about PMU side-channels, Intel is constantly > > reminding us that no such attacks are yet known. We would probably > > restrict some events to guests that occupy an entire socket, just to > > be safe. > > > > Note that on the flip side, TDX and SEV are all about catering to > > guests that don't trust the host. Those customers probably don't want > > the host to be able to use the PMU to snoop on guest activity. > > > >>> 2) Introduce a new "ultimate" priority level in the host perf > >>> subsystem. Only KVM can request events at the ultimate priority, and > >>> these requests supersede any other requests. > >> > >> The "ultimate" priority level doesn't help in the above case. The > >> counter M may not bring the precise which guest requests. I remember you > >> called it "broken". > > > > Ideally, ultimate priority also comes with the ability to request > > specific counters. > > > >> KVM can fails the case, but KVM cannot notify the guest. The guest still > >> see wrong result. > >> > >>> > >>> Other solutions are welcome. > >> > >> I don't have a perfect solution to achieve all your requirements. Based > >> on my understanding, the guest has to be compromised by either > >> tolerating some errors or dropping some features (e.g., some special > >> events). With that, we may consider the above "ultimate" priority level > >> policy. The default policy would be the same as the current > >> implementation, where the host perf treats all the users, including the > >> guest, equally. The administrators can set the "ultimate" priority level > >> policy, which may let the KVM/guest pin/own some regular counters via > >> perf subsystem. That's just my personal opinion for your reference. > > > > I disagree. The guest does not have to be compromised. For a proof of > > concept, see VMware ESXi. Probably Microsoft Hyper-V as well, though I > > haven't checked. > > As far as I know, VMware ESXi has its own VMkernel, which can owns the > entire HW PMUs. The KVM is part of the Linux kernel. The HW PMUs should > be shared among components/users. I think the cases are different. Architecturally, ESXi is not very different from Linux. The VMkernel is a posix-compliant kernel, and VMware's "vmm" is comparable to kvm. > Also, from what I searched on the VMware website, they still encounter > the case that a guest VM may not get a performance monitoring counter. > It looks like their solution is to let guest OS check the availability > of the counter, which is similar to the solution I mentioned (Use > GLOBAL_INUSE MSR). > > "If an ESXi host's BIOS uses a performance counter or if Fault Tolerance > is enabled, some virtual performance counters might not be available for > the virtual machine to use." I'm perfectly happy to give up PMCs on Linux under those same conditions. > https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.vm_admin.doc/GUID-F920A3C7-3B42-4E78-8EA7-961E49AF479D.html > > "In general, if a physical CPU PMC is in use, the corresponding virtual > CPU PMC is not functional and is unavailable for use by the guest. Guest > OS software detects unavailable general purpose PMCs by checking for a > non-zero event select MSR value when a virtual machine powers on." > > https://kb.vmware.com/s/article/2030221 > Linux, at least, doesn't do that. Maybe Windows does. > Thanks, > Kan >