Re: [RFC PATCH 1/6] kvm: add device control API

Scott Wood <scottwood@xxxxxxxxxxxxx> · Mon, 25 Feb 2013 20:38:23 -0600

On 02/24/2013 09:46:17 AM, Gleb Natapov wrote:
On Thu, Feb 21, 2013 at 08:17:54PM -0600, Scott Wood wrote:
>  On 02/21/2013 02:22:09 AM, Gleb Natapov wrote:
> >On Wed, Feb 20, 2013 at 08:05:12PM -0600, Scott Wood wrote:
> >> You mean like what we did with SREGS, that got deprecated and
> >> replaced with ONE_REG?
> >>
> >SREGS is implemented by ppc. I see ONE_REG as addition to REGS
> >interface. You can access all register at once or you can access  
them
> >one by one. If there is a need we can add MULTIPLE_REGS that will  
get
> >list of requested REGS.
>
> http://www.spinics.net/lists/kvm-ppc/msg04876.html
> http://www.spinics.net/lists/kvm-ppc/msg05842.html
>
If Alex prefers ONE_REG interface this is his call :)

My point is the reasons we prefer it, that SREGS really is deprecated  
for
new state on PPC (REGS is not extensible so it's been in that state for
even longer -- hence SREGS), and the desire to not create that same mess
again.  Running out of space is just one of the ways in which SREGS is a
mess, and I think trying to represent MPIC state that way would be even
worse.

> >The interface is not over generic i.e it does
> >not try to replace KVM_RUN by writing special register.
>
> Sigh.
>
> >>                                                    And we
> >> don't necessarily want to set *everything*.
> >What are those cases? You do need to on reset/migration.
>
> Why do we want to set all the registers on reset, rather than tell
> the in-kernel device to reset?  The default state came from the
> kernel in the first place on irqchip creation...
>
I have nothing against telling in-kernel device to reset provided  
there
is a way to do so, which current interface lacks.

If by "current" you mean what is proposed here, you can use
KVM_MPIC_DEV_MPIC_REGISTER to set GCR[RESET].

I can see the appeal of an explicit "reset device" command, though,
especially if the hardware reset bit is defined such that you're  
supposed
to poll until it's cleared to know that the reest is complete (the KVM
implementation is not likely to return before the reset is complete, but
it would be better to not rely on such knowledge).

Reset in userspase has
its advantage too: bugs are easier to fix,

They're harder to fix if fixing it requires adding something to the API.
Bugs in reset are just a small portion of overall bugs, so I'm not sure
how important this is.  It would require that we implement additional
state accessors now, rather than if and when we do migration.  It could
cause the total bug count to go up versus code that already works and is
self-contained.  :-P

there may be different kind of resets (hard/soft).

This can be communicated through a device-specific command if it is
applicable (it isn't for MPIC).

> Again, we do not currently support migration on MPIC.  It is a very
> low priority for embedded.  We do not wish to rule it out entirely,
> but it most likely would require adding more state accesors.
The interface suppose to be generic,
we are not talking about MPIC
specifically here.

There are two separate things here.  There is the device control api,
which is generic, but is just a way to create devices and issue commands
to them.

Then there is the specific device implementation, which is not generic.
The details of how migration is handled for a particular device belong  
to
the specific device implementation.

Regarding migration "never say never"  :)

I did say we don't want to rule it out -- I'm just emphasising this so
that we don't get bogged down in questions of how the currently defined
MPIC attributes/commands would get used for migration.  If we tried to
blindly add support for that now, without being able to test, or being
able to refer to what QEMU normally saves/restores for MPIC, we'd
probably get it wrong.  Just as the untested savevm code in QEMU's
hw/openpic.c is wrong/incomplete.

> As for other types of devices, x86 has i8254.c emulated in-kernel --
> I know that's not going to switch to the new interface, but it could
> have if it existed back then.
Since it is not going to switch it is not a good example.

The point of the example is to show that irqchips aren't the only thing
that could ever make sense to emulate in the kernel.  I can't see into
the future and tell you what the actual first non-irqchip user would be,
or if there will definitely be one.

If we wait until that user comes along, we'd end up saying *that* is
the only user and "since MPIC is not going to switch it is not a good
example".  If no non-irqchip user ends up coming along, what is the
actual harm from not putting the words "irqchip" in the interface
description (the generic part, not the device-specific part)?

>                                I can also see creating an in-kernel
> emulation device for doing MMIO filtering on some piece of embedded
> hardware that guests need to access with reasonable performance, but
> the hardware desginers screwed up the protection slightly (e.g. put
> other things in the same 4K page).  We've done such filtering before
> in our standalone hypervisor; the question is whether it happens to
> anything with enough performance requirements to be done in the
> kernel.
I am not sure why special device is needed for such filtering. If MMIO
is not handled by the kernel it is forwarded to a userspace.

"with reasonable performance"

> For KVM_IRQ_LINE, we could perhaps use the device id as the irqchip
> id in kvm_irq_routing_irqchip (or, have an attribute/command that
> retrieves the id to be used there).  Unfortunately there is no
> irqchip field in kvm_irq_routing_msi, though since it's basically a
> command to write to an arbitrary MMIO address, maybe it could just
> be implemented that way?
>
How MSI is delivered with MPIC? From device point of view sending an  
MSI
is doing write at address X of data Y. How PPC with MPIC translates  
this
into actual interrupt?

Alex described pretty much what happens.

I see that currently MSIs are just hardcoded to go to
kvm_irq_delivery_to_apic().  This could become a function pointer to
whatever irqchip has registered itself as handling MSIs for a particular
VM.  If we ever have more than one irqchip at the same time that could
handle an MSI, then we'd need something more complicated, but hopefully
we never see that.  For irqfd MSIs I'd rather avoid the overhead of  
going
through the full MMIO emulation path.

> >As long as you use the same attribute for migration and interrupt
> >injection
> >purpose I do. If you use separate attributes for migration and
> >interrupt
> >injection then not having separate ioctl is just a hack.
>
> Why is it a hack?  Is it also a hack to not use a separate ioctl to
> reset the device, to move its address map, etc?
>
It is a hack because purpose of the interface becomes unclear.

You can have a separate interface without having a separate ioctl.  I
understand the objection to calling certain types of commands
"attributes", but I don't understand why "set this attribute" can't be
called a command.

If you
see it called in a code you have no idea what semantics is expected.

Hmm?  If you see a call to KVM_IOCTL_FOO, you look up the documentation
for KVM_IOCTL_FOO to see what the semantics are.  If you see a call to
KVM_DEVICE_COMMAND with a KVM_DEV_MPIC_GRP_FOO command group, you look  
at
mpic.txt to see what the semantics are.

For instance getting/setting of a state should be done when vcpus are
not running,

I don't have anything on MPIC that I would currently want to define as
"state" under that definition.  If/when migration is implemented, *some*
of the state might have that restriction, and those would be accessed  
via
something other than KVM_DEV_MPIC_GRP_REGISTER.  I would prefer not to
create accessors for all state in this way, just where something special
is needed -- as in PPC booke ONE_REG where only things like TSR need
special handling.  Plus, most of what would need special handling on  
MPIC
is totally hidden state (e.g. the pending/active queues and the
interrupt input state) rather than a special variant of an existing
register.  The registers that show active MSIs (normally read-only and
read-to-clear) are one of the few exceptions.

Note that earlier you said that you'd be OK with calling the base  
address
an attribute, even if it can change during guest execution -- so is
"should be done when vcpus are not running" really a good thing to base
any such split on?

but commands may be sent while they are. This also
encourage bugs when incorrect attribute is used during migration or  
vice
versa.

How is that different from specifing the flag incorrectly when you set  
an MSR?

In short interface does not help you, it requires you to read
documentation of each device very carefully.

You need to read it carefully enough to know what it does and how to use
it, regardless of any such splitting of ioctls.  This is especially true
if/when we start adding hidden state for migration, since the format of
that state is not defined by the device itself.

Reset is one thing that will be implemented as a command ioctl.   
Moving
address map depends.  If an address is a part of device's sate and  
will
be migrated as such then it is device attribute, otherwise use command
to move it.

It would be migrated, but can also be set at runtime for non-migration
reasons.

-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html