Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace

Anthony Liguori <anthony@xxxxxxxxxxxxx> · Mon, 07 Jun 2010 12:04:53 -0500

On 06/07/2010 10:26 AM, Avi Kivity wrote:
I am currently investigating a problem with the a guest running Linux 
malfunctioning in the NMI watchdog code.  The problem is that we don't 
handle NMI delivery mode for the local APIC LINT0 pin; instead we 
expect ExtInt deliver mode or that the line is disabled completely.  
In addition the i8254 timer is tied to the BSP, while in this case the 
timer can broadcast to all vcpus.

There is some code that tries to second-guess the guest and provide it 
the inputs it sees, but this is fragile.  The only way to get reliable 
operation is to emulate the hardware fully.

Now I'd much rather do that in userspace, since it's a lot of 
sensitive work.  I'll enumerate below the general motivation, 
advantages and disadvantages, and a plan for moving forward.

Motivation
==========

The original motivation for moving the PIC and IOAPIC into the kernel 
was performance, especially for assigned devices.  Both devices are 
high interaction since they deal with interrupts; practically after 
every interrupt there is either a PIC ioport write, or an APIC bus 
message, both signalling an EOI operation.  Moving the PIT into the 
kernel allowed us to catch up with missed timer interrupt injections, 
and speeded up guests which read the PIT counters (e.g. tickless guests).

However, modern guests running on modern qemu use MSI extensively; 
both virtio and assigned devices now have MSI support; and the planned 
VFIO only supports kernel delivery via MSI anyway; line based 
interrupts will need to be mediated by userspace.

The only high frequency non-MSI interrupt sources remaining are the 
various timers; and the default one, HPET, is in userspace (and having 
its own scaling problems as a result).  So in theory we can move PIC, 
IOAPIC, and PIT support to userspace and not lose much performance.

I think we could also move the local APIC.

To optimize device models, we've tended to put the full device model in 
the kernel whereas the hardware vendors have tended to put only the fast 
paths of the devices models in hardware.

For instance, we could introduce a userspace interface similar to vapic 
support whereas a shared page that mapped the APIC's layout was used 
with a mask to select which registers trapped on read/write.

That said, I can understand an argument that the local APIC is part of 
the CPU state since it's a very special type of device.

A better example would be a generic counter kernel mechanism.  I can 
envision such a device as doing nothing more than providing a read-only 
view of a counter with a userspace configurable divider and width.  Any 
write to the counter or read of any other byte outside the counter 
register would result in a trap to userspace.

That should allow both the PIT and the HPET to be accelerated with 
minimal effort in the kernel.

Moving the implementation to userspace allows us more flexibility, and 
more consistency in the implementation of timekeeping for the various 
clock chips; it becomes easier to follow the nuances of real hardware 
in this area.

Interestingly, while the IOAPIC/PIC code was written we proposed 
making it independent of the local APIC; had we done so, the move 
would have been much easier (simply dropping the existing code).

Advantages of a move
====================

1. Reduced kernel footprint

Good for security, and allows fixing bugs without reboots.

2. Centralized timekeeping

Instead of having one solution for PIT timekeeping, and another for 
RTC and HPET timekeeping, we can have all timer chips in userspace.  
The local APIC timer still needs to be in the kernel - it is much too 
high bandwidth to be in userspace; but on the other hand it is very 
different from the other timer chips.

3. Flexibility

Easier to have wierd board layouts (multiple IOAPICs, etc.).  Not a 
very strong advantage.

Disadvantages
=============

1. Still need to keep the old code around for a long while

We can't just rip it out - old userspace depends on it.  So the 
security advantages are only with cooperating userspace, and the other 
advantages only show up.

2. Need to bring the qemu code up to date

The current qemu ioapic code lags some way behind the kernel; also 
need PIT timekeeping

3. May need kernel support for interval-timer-follows-thread

Currently the timekeeping code has an optimization which causes the 
hrtimer that models the PIT to follow the BSP (which is most likely to 
receive the interrupt); this reduces cpu cross-talk.

I don't think the kernel interval timer code has such an optimization; 
we may need to implement it.

4. Much churn

This is a lot of work.

I'd be in favor of a straight port to userspace.  We already have the 
interfaces to communicate with an external device model for these 
devices so let's just take the kernel code and stick it into dedicated 
threads in userspace.

I think it's easier to then work to merge the two bits of code in the 
same tree than it is to try and take out-of-tree code and merge it 
incrementally.

5. Risk

We may find out after all this is implemented that performance is not 
acceptable and all the work will have to be dropped.

That's another advantage to a straight port to userspace.  We can 
collect performance data with only a modest amount of engineering effort.

Regards,

Anthony Liguori

Proposed interface
==================

1. KVM_SET_LINT_PIN (vcpu ioctl)

Sets the value (0 or 1) that a vcpu's LINT0 or LINT1 senses.

Note: problematic; may be high frequency but ignored due to masking at 
the local APIC LVT level.  Will also be broadcast across all vcpus by 
userspace with typical configurations.  We may need a way to tell 
userspace we'll be ignoring those signals.

May also be extended to emulate thermal interrupts if someone feels 
the need.

An alternative is a couple of new fields in kvm_run which are sampled 
on every entry (unless masked).

2. KVM_EXIT_REASON_INTACK (kvm_run exit reason)

Informs userspace that the vcpu is running an INTACK cycle; userspace 
should provide the interrupt vector on the next KVM_VCPU_RUN.

3. KVM_APIC_MESSAGE (vm ioctl)

Sends an APIC message on the APIC message bus, if the destination is 
in the kernel (typically IOAPIC interrupt messages).

4. KVM_EXIT_REASON_APIC_MESSAGE (kvm_run exit reason)

Sends an APIC message on the APIC message bus, if the destination is 
not in the kernel (typically IOAPIC EOI messages).

The above are all architectural, and correspond to wires on physical 
systems.  This increases the confidence that they are correct.

5. KVM_REQUEST_EOI (vcpu ioctl) / KVM_EXIT_EOI (kvm_run exit reason)

We will get EOI messages via KVM_EXIT_REASON_APIC_MESSAGE for 
level-triggered interrupts.  However, for timekeeping we will also 
need a an EOI for edge triggered interrupts (if we choose the ack 
notifier method for timekeeping).

6. KVM_EXIT_REASON_LVT_MASK (kvm_run exit reason)

A notification that the LVT LINT0 or LVT LINT1 mask bit has changed, 
and thus we don't need to issue useless KVM_SET_LINT_PIN ioctls; also 
useful for timekeeping (can disable PIT if configured with ExtInt mode 
or lapic disabled).

7. KVM_EXIT_REASON_APIC_MESSAGE_ACK (kvm_run exit reason)

If we use the current timekeeping method of detecting coalesced 
interrupts, we'll need an acknowledge when an APIC message is accepted 
by a local APIC, with the result (interrupt queued or interrupt 
coalesced).  This will need to be selectable by vcpu and vector number.

8. KVM_CREATE_IRQCHIP (vm ioctl)

A new flag that tells kvm not to create a PIC and IOAPIC.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html