Re: [RFC PATCH 00/47] Address Space Isolation for KVM

Alexandre Chartre <alexandre.chartre@xxxxxxxxxx> · Fri, 8 Apr 2022 10:52:12 +0200

On 3/23/22 20:35, Junaid Shahid wrote:
On 3/22/22 02:46, Alexandre Chartre wrote:

On 3/18/22 00:25, Junaid Shahid wrote:

I agree that it is not secure to run one sibling in the
unrestricted kernel address space while the other sibling is
running in an ASI restricted address space, without doing a cache
flush before re-entering the VM. However, I think that avoiding
this situation does not require doing a sibling stun operation
immediately after VM Exit. The way we avoid it is as follows.

First, we always use ASI in conjunction with core scheduling.
This means that if HT0 is running a VCPU thread, then HT1 will be
running either a VCPU thread of the same VM or the Idle thread.
If it is running a VCPU thread, then if/when that thread takes a
VM Exit, it will also be running in the same ASI restricted
address space. For the idle thread, we have created another ASI
Class, called Idle-ASI, which maps only globally non-sensitive
kernel memory. The idle loop enters this ASI address space.

This means that when HT0 does a VM Exit, HT1 will either be
running the guest code of a VCPU of the same VM, or it will be
running kernel code in either a KVM-ASI or the Idle-ASI address
space. (If HT1 is already running in the full kernel address
space, that would imply that it had previously done an ASI Exit,
which would have triggered a stun_sibling, which would have
already caused HT0 to exit the VM and wait in the kernel).

Note that using core scheduling (or not) is a detail, what is
important is whether HT are running with ASI or not. Running core
scheduling will just improve chances to have all siblings run ASI
at the same time and so improve ASI performances.

If HT1 now does an ASI Exit, that will trigger the
stun_sibling() operation in its pre_asi_exit() handler, which
will set the state of the core/HT0 to Stunned (and possibly send
an IPI too, though that will be ignored if HT0 was already in
kernel mode). Now when HT0 tries to re-enter the VM, since its
state is set to Stunned, it will just wait in a loop until HT1
does an unstun_sibling() operation, which it will do in its
post_asi_enter handler the next time it does an ASI Enter (which
would be either just before VM Enter if it was KVM-ASI, or in the
next iteration of the idle loop if it was Idle-ASI). In either
case, HT1's post_asi_enter() handler would also do a
flush_sensitive_cpu_state operation before the unstun_sibling(), 
so when HT0 gets out of its wait-loop and does a VM Enter, there
will not be any sensitive state left.

One thing that probably was not clear from the patch, is that
the stun state check and wait-loop is still always executed
before VM Enter, even if no ASI Exit happened in that execution.

So if I understand correctly, you have following sequence:

0 - Initially state is set to "stunned" for all cpus (i.e. a cpu
should wait before VMEnter)

1 - After ASI Enter: Set sibling state to "unstunned" (i.e. sibling
can do VMEnter)

2 - Before VMEnter : wait while my state is "stunned"

3 - Before ASI Exit : Set sibling state to "stunned" (i.e. sibling
should wait before VMEnter)

I have tried this kind of implementation, and the problem is with
step 2 (wait while my state is "stunned"); how do you wait exactly?
You can't just do an active wait otherwise you have all kind of
problems (depending if you have interrupts enabled or not)
especially as you don't know how long you have to wait for (this
depends on what the other cpu is doing).

In our stunning implementation, we do an active wait with interrupts 
enabled and with a need_resched() check to decide when to bail out
to the scheduler (plus we also make sure that we re-enter ASI at the
end of the wait in case some interrupt exited ASI). What kind of
problems have you run into with an active wait, besides wasted CPU
cycles?

If you wait with interrupts enabled then there is window after the
wait and before interrupts get disabled where a cpu can get an interrupt,
exit ASI while the sibling is entering the VM. Also after a CPU has passed
the wait and have disable interrupts, it can't be notified if the sibling
has exited ASI:

T+01 - cpu A and B enter ASI - interrupts are enabled
T+02 - cpu A and B pass the wait because both are using ASI - interrupts are enabled
T+03 - cpu A gets an interrupt
T+04 - cpu B disables interrupts
T+05 - cpu A exit ASI and process interrupts
T+06 - cpu B enters VM  => cpu B runs VM while cpu A is not using ASI
T+07 - cpu B exits VM
T+08 - cpu B exits ASI
T+09 - cpu A returns from interrupt
T+10 - cpu A disables interrupts and enter VM => cpu A runs VM while cpu A is not using ASI

In any case, the specific stunning mechanism is orthogonal to ASI.
This implementation of ASI can be integrated with different stunning
implementations. The "kernel core scheduling" that you proposed is
also an alternative to stunning and could be similarly integrated
with ASI.

Yes, but for ASI to be relevant with KVM to prevent data leak, you need
a fully functional and reliable stunning mechanism, otherwise ASI is
useless. That's why I think it is better to first focus on having an
effective stunning mechanism and then implement ASI.

alex.