Re: Semantics of "-cpu host" (was Re: [Qemu-devel] [PATCH 2/2] Expose tsc deadline timer cpuid to guest)

Eduardo Habkost <ehabkost@xxxxxxxxxx> · Thu, 10 May 2012 11:12:55 -0300

On Thu, May 10, 2012 at 04:39:45PM +0300, Gleb Natapov wrote:
> On Thu, May 10, 2012 at 03:21:41PM +0200, Alexander Graf wrote:
> > On 05/10/2012 02:53 PM, Gleb Natapov wrote:
> > >On Wed, May 09, 2012 at 04:38:02PM -0300, Eduardo Habkost wrote:
> > >>On Wed, May 09, 2012 at 12:38:37PM +0300, Gleb Natapov wrote:
> > >>>On Wed, May 09, 2012 at 11:05:58AM +0200, Alexander Graf wrote:
> > >>>>On 09.05.2012, at 10:51, Gleb Natapov wrote:
> > >>>>
> > >>>>>On Wed, May 09, 2012 at 10:42:26AM +0200, Alexander Graf wrote:
> > >>>>>>
> > >>>>>>On 09.05.2012, at 10:14, Gleb Natapov<gleb@xxxxxxxxxx>  wrote:
> > >>>>>>
> > >>>>>>>On Wed, May 09, 2012 at 12:07:04AM +0200, Alexander Graf wrote:
> > >>>>>>>>On 08.05.2012, at 22:14, Eduardo Habkost wrote:
> > >>>>>>>>
> > >>>>>>>>>On Tue, May 08, 2012 at 02:58:11AM +0200, Alexander Graf wrote:
> > >>>>>>>>>>On 07.05.2012, at 20:21, Eduardo Habkost wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>>Andre? Are you able to help to answer the question below?
> > >>>>>>>>>>>
> > >>>>>>>>>>>I would like to clarify what's the expected behavior of "-cpu host" to
> > >>>>>>>>>>>be able to continue working on it. I believe the code will need to be
> > >>>>>>>>>>>fixed on either case, but first we need to figure out what are the
> > >>>>>>>>>>>expectations/requirements, to know _which_ changes will be needed.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>On Tue, Apr 24, 2012 at 02:19:25PM -0300, Eduardo Habkost wrote:
> > >>>>>>>>>>>>(CCing Andre Przywara, in case he can help to clarify what's the
> > >>>>>>>>>>>>expected meaning of "-cpu host")
> > >>>>>>>>>>>>
> > >>>>>>>>>>>[...]
> > >>>>>>>>>>>>I am not sure I understand what you are proposing. Let me explain the
> > >>>>>>>>>>>>use case I am thinking about:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>- Feature FOO is of type (A) (e.g. just a new instruction set that
> > >>>>>>>>>>>>doesn't require additional userspace support)
> > >>>>>>>>>>>>- User has a Qemu vesion that doesn't know anything about feature FOO
> > >>>>>>>>>>>>- User gets a new CPU that supports feature FOO
> > >>>>>>>>>>>>- User gets a new kernel that supports feature FOO (i.e. has FOO in
> > >>>>>>>>>>>>GET_SUPPORTED_CPUID)
> > >>>>>>>>>>>>- User does _not_ upgrade Qemu.
> > >>>>>>>>>>>>- User expects to get feature FOO enabled if using "-cpu host", without
> > >>>>>>>>>>>>upgrading Qemu.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>The problem here is: to support the above use-case, userspace need a
> > >>>>>>>>>>>>probing mechanism that can differentiate _new_ (previously unknown)
> > >>>>>>>>>>>>features that are in group (A) (safe to blindly enable) from features
> > >>>>>>>>>>>>that are in group (B) (that can't be enabled without an userspace
> > >>>>>>>>>>>>upgrade).
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>In short, it becomes a problem if we consider the following case:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>- Feature BAR is of type (B) (it can't be enabled without extra
> > >>>>>>>>>>>>userspace support)
> > >>>>>>>>>>>>- User has a Qemu version that doesn't know anything about feature BAR
> > >>>>>>>>>>>>- User gets a new CPU that supports feature BAR
> > >>>>>>>>>>>>- User gets a new kernel that supports feature BAR (i.e. has BAR in
> > >>>>>>>>>>>>GET_SUPPORTED_CPUID)
> > >>>>>>>>>>>>- User does _not_ upgrade Qemu.
> > >>>>>>>>>>>>- User simply shouldn't get feature BAR enabled, even if using "-cpu
> > >>>>>>>>>>>>host", otherwise Qemu would break.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>If userspace always limited itself to features it knows about, it would
> > >>>>>>>>>>>>be really easy to implement the feature without any new probing
> > >>>>>>>>>>>>mechanism from the kernel. But that's not how I think users expect "-cpu
> > >>>>>>>>>>>>host" to work. Maybe I am wrong, I don't know. I am CCing Andre, who
> > >>>>>>>>>>>>introduced the "-cpu host" feature, in case he can explain what's the
> > >>>>>>>>>>>>expected semantics on the cases above.
> > >>>>>>>>>>Can you think of any feature that'd go into category B?
> > >>>>>>>>>- TSC-deadline: can't be enabled unless userspace takes care to enable
> > >>>>>>>>>the in-kernel irqchip.
> > >>>>>>>>The kernel can check if in-kernel irqchip has it enabled and otherwise mask it out, no?
> > >>>>>>>>
> > >>>>>>>How kernel should know that userspace does not emulate it?
> > >>>>>>You have to enable the in-kernel apic to use it, at which point the kernel knows it's in use, right?
> > >>>>>>
> > >>>>>>>>>- x2apic: ditto.
> > >>>>>>>>Same here. For user space irqchip the kernel side doesn't care. If in-kernel APIC is enabled, check for its capabilities.
> > >>>>>>>>
> > >>>>>>>Same here.
> > >>>>>>>
> > >>>>>>>Well, technically both of those features can't be implemented in
> > >>>>>>>userspace right now since MSRs are terminated in the kernel, but I
> > >>>>>>Doesn't sound like the greatest design - unless you deprecate the non-in-kernel apic case.
> > >>>>>>
> > >>>>>You mean terminating MSRs in kernel does not sound like the greatest
> > >>>>>design? I do not disagree. That is why IMO kernel can't filter out
> > >>>>>TSC-deadline and x2apic like you suggest.
> > >>>>I still don't see why it can't.
> > >>>>
> > >>>>Imagine we would filter TSC-deadline and x2apic by default in the kernel - they are not known to exist yet.
> > >>>>Now, we implement TSC-deadline in the kernel. We still filter
> > >>>>TSC-deadline out in GET_SUPORTED_CPUID in the kernel. But we provide
> > >>>>an interface to user space that says "call me to enable TSC-deadline
> > >>>>CPUID, but only if you're using the in-kernel apic"
> > >>We have that interface already, it is called KVM_SET_CPUID.  :-)
> > >>
> > >>>>New user space calls that ioctl when it's using the in-kernel apic, it doesn't when it's using the user space apic.
> > >>>>Old user space doesn't call that ioctl.
> > >>>First of all we already have TSC-deadline in GET_SUPORTED_CPUID without
> > >>>any additional ioctls. And second I do not see why we need additional
> > >>>iostls here.
> > >>We don't have TSC-deadline set today (and that's what started this
> > >>thread), but we have x2apic. So what you say is true for x2apic, anyway.
> > >>
> > >Hmm, strange. But s/TSC-deadline/x2apic/ and the point stands :)
> > >
> > >>>Hmm, so may be I misunderstood you. You propose to mask TSC-deadline and
> > >>>x2apic out from GET_SUPORTED_CPUID if irq chip is not in kernel, not
> > >>>from KVM_SET_CPUID? For those two features it may make sense indeed.
> > >>It makes sense to me.
> > >>
> > >>It looks like my assumptions were wrong. They were:
> > >>
> > >>- GET_SUPPORTED_CPUID simply can't know if the in-kernel irqchip is
> > >>   going to be enabled or not.
> > >I think this is currently true. Before we changing that we have to make
> > >sure that no existing userspace calls GET_SUPPORTED_CPUID before
> > >creating irqchip. Not sure we can check all existing userspaces.
> > >
> > >>- GET_SUPPORTED_CPUID output has to be a function of the kernel code
> > >>   capabilitie and host CPU, and not depend on any input from userspace.
> > >>
> > >>Are those assumptions incorrect? If we break them, we may try what
> > >>Alexander is proposing. It would be much more flexible than the options
> > >>I was considering.
> > >>
> > >>I didn't know ENABLE_CAP existed. Even if GET_SUPPORTED_CPUID can't
> > >What is ENABLE_CAP?
> > 
> > It's an ioctl that you pass in a CAP. It's used on non-x86 already
> > to easily turn on features inside kvm :).
> > 
> No wonder grep in arch/x86 told me nothing :) So for x86 this ioctl as
> good as non-existent for now.
> 
> > >
> > >>check for the in-kernel irqchip setup for some reason, ENABLE_CAP could
> > >>be used by userpace to tell the kernel "I will enable the in-kernel
> > >>irqchip, so feel free to return features that depend on it on
> > >>GET_SUPPORTED_CPUID".
> > >>
> > >>In other words, we would return only the type-A features on
> > >>GET_SUPPORTED_CPUID (i.e. safe to be blindly enabled by -cpu host as
> > >>long as migration is not required), but if we use ENABLE_CAP we can make
> > >>group A safely grow, as long as userspace first tells the kernel what it
> > >>supports.
> > >>
> > >>Anybody is against doing that? Otherwise I plan to work on this.
> > >>Probably I will start by making GET_SUPPORTED_CPUID not return x2apic
> > >>unless userspace tells the kernel (using ENABLE_CAP) that it will enable
> > >>the in-kernel irqchip. Then we can do the same with TSC-deadline.
> > >>
> > >>Note that all this work is to allow the kernel to let userspace blindly
> > >>enable features it _doesn't know yet_. If we limit ourselves to features
> > >>userspace already knows about, we could simply remove x2apic and
> > >>TSC-deadline from GET_SUPPORTED_CPUID completely, not use ENABLE_CAP for
> > >>that, and let userspace set x2apic or TSC-deadline on KVM_SET_CPUID only
> > >>if it knows it is safe (either because it checked for the corresponding
> > >>KVM_CAP_* capability is present and it will enable the in-kernel
> > >>irqchip, or because it will emulated it in userspace).
> > >>
> > >There is not KVM_CAP_* for x2apic as far as I see.
> > >
> > >>In case anybody is against the proposal above: note that the current
> > >>documented GET_SUPPORTED_CPUID semantics (unconditionally returning bits
> > >>that depend on specific userspace behavior/capabilities) is simply
> > >>unusable by "-cpu host". If the above proposal gets rejected, my Plan B
> > >>is to update the GET_SUPPORTED_CPUID documentation to note that it
> > >>returns only type-A features (features that userspace can safely enable
> > >>even if it doesn't know what it does), remove x2apic from
> > >>GET_SUPPORTED_CPUID forever, and use KVM_CAP_* for discovery of all
> > >>type-B features (features that depend on specific userspace
> > >>capabilities/behavior).
> > >>
> > >This will break existing userspaces, no?
> > 
> > It would mean that new user space on old kernels don't get x2apic or
> > tsc-deadline in cpuid, yes.
> > 
> tsc-deadline actually exposed via capability, but disappearance of
> x2apic would be serious regression.

To keep compatibility, we can let userspace tell the kernel that they
want the "safe" version of GET_SUPPORTED_CPUID (using SET_CAP?). If
userspace doesn't do that, x2apic would be kept enabled on
GET_SUPPORTED_CPUID for compatiblity.

The problem here is that the current behavior of GET_SUPPORTED_CPUID
(returning x2apic) is unusable by -cpu host, unless we make the
in-kernel irqchip mandatory for -cpu host (but that would be solving
just our immediate needs, and not any similar cases in the future).

> > >
> > >>>Not
> > >>>sure there won't be others that are not dependent on irq chip presence.
> > >>>You propose to add additional ioctls to enable them if they appear?
> > >>I am sure there will be new features in the future that don't depend on
> > >>any userspace support, so they would be enabled on GET_SUPPORTED_CPUID
> > >>unconditionally.
> > >>
> > >Those not the kind of features I am worry about though. I worry about
> > >the features that can be emulated either by kernel or by userspace and
> > >userspace may choose where it wants the feature to be emulated.
> > 
> > Like? Usually CPUID features reflect new instructions. We don't
> > notify user space on unknown instructions, right?
> > 
> No we do not. I can't give you concrete example, but think about some
> apic unrelated device that is accessible via MSR and can be emulated
> either in kernel or in userspace. Hmm, maybe something like ACPI
> (Thermal Monitor and Software Controlled Clock Facilities), or PSN 
> (Processor Serial Number).

Those cases are the easy ones (or maybe they're just the ones I don't
worry about, right now :).

- If the kernel can't emulate it, the feature would never appear on
  GET_SUPPORTED_CPUID, but userspace could enable it using SET_CPUID if
  userspace is going to emulate it.
  - In case userspace emulation don't depend on any extra kernel
    capability, userspace would simply set it on SET_CPUID.
    - Now, what happens here if a future kernel version becomes able to
      emulate it in-kernel?
  - In case userspace emulation depend on some kernel capability,
    userspace can use CHECK_EXTENSION before setting it on SET_CPUID.
- If the kernel can emulate it without depending on any userspace
  capability/behavior, the feature would appear on GET_SUPPORTED_CPUID.
  - If userspace wants to use the kernel emulation, it would just enable
    it on SET_CPUID.
  - If userspace wants to emulate it itself, it would set it on
    SET_CPUID, but it would need to ask the kernel to disable the
    in-kernel emulation (how?).

I am more worried about this case:

- If the kernel can emulate it, but depends on specific userspace
  capability/behavior to work. That feature can't appear on
  GET_SUPPORTED_CPUID by default, otherwise older qemu versions using
  "-cpu host" would try to enable it without doing what's necessary to
  make it work. This is the case for x2apic and TSC-deadline.

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html