Re: [RFC KVM 18/27] kvm/isolation: function to copy page table entries for percpu buffer

Alexandre Chartre <alexandre.chartre@xxxxxxxxxx> · Tue, 14 May 2019 18:24:48 +0200

On 5/14/19 5:23 PM, Andy Lutomirski wrote:
On Tue, May 14, 2019 at 2:42 AM Alexandre Chartre
<alexandre.chartre@xxxxxxxxxx> wrote:

On 5/14/19 10:34 AM, Andy Lutomirski wrote:

On May 14, 2019, at 1:25 AM, Alexandre Chartre <alexandre.chartre@xxxxxxxxxx> wrote:

On 5/14/19 9:09 AM, Peter Zijlstra wrote:
On Mon, May 13, 2019 at 11:18:41AM -0700, Andy Lutomirski wrote:
On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
<alexandre.chartre@xxxxxxxxxx> wrote:

pcpu_base_addr is already mapped to the KVM address space, but this
represents the first percpu chunk. To access a per-cpu buffer not
allocated in the first chunk, add a function which maps all cpu
buffers corresponding to that per-cpu buffer.

Also add function to clear page table entries for a percpu buffer.

This needs some kind of clarification so that readers can tell whether
you're trying to map all percpu memory or just map a specific
variable.  In either case, you're making a dubious assumption that
percpu memory contains no secrets.
I'm thinking the per-cpu random pool is a secrit. IOW, it demonstrably
does contain secrits, invalidating that premise.

The current code unconditionally maps the entire first percpu chunk
(pcpu_base_addr). So it assumes it doesn't contain any secret. That is
mainly a simplification for the POC because a lot of core information
that we need, for example just to switch mm, are stored there (like
cpu_tlbstate, current_task...).

I don’t think you should need any of this.

At the moment, the current code does need it. Otherwise it can't switch from
kvm mm to kernel mm: switch_mm_irqs_off() will fault accessing "cpu_tlbstate",
and then the page fault handler will fail accessing "current" before calling
the kvm page fault handler. So it will double fault or loop on page faults.
There are many different places where percpu variables are used, and I have
experienced many double fault/page fault loop because of that.

Now you're experiencing what working on the early PTI code was like :)

This is why I think you shouldn't touch current in any of this.

If the entire first percpu chunk effectively has secret then we will
need to individually map only buffers we need. The kvm_copy_percpu_mapping()
function is added to copy mapping for a specified percpu buffer, so
this used to map percpu buffers which are not in the first percpu chunk.

Also note that mapping is constrained by PTE (4K), so mapped buffers
(percpu or not) which do not fill a whole set of pages can leak adjacent
data store on the same pages.

I would take a different approach: figure out what you need and put it in its
own dedicated area, kind of like cpu_entry_area.

That's certainly something we can do, like Julian proposed with "Process-local
memory allocations": https://lkml.org/lkml/2018/11/22/1240

That's fine for buffers allocated from KVM, however, we will still need some
core kernel mappings so the thread can run and interrupts can be handled.

One nasty issue you’ll have is vmalloc: the kernel stack is in the
vmap range, and, if you allow access to vmap memory at all, you’ll
need some way to ensure that *unmap* gets propagated. I suspect the
right choice is to see if you can avoid using the kernel stack at all
in isolated mode.  Maybe you could run on the IRQ stack instead.

I am currently just copying the task stack mapping into the KVM page table
(patch 23) when a vcpu is created:

         err = kvm_copy_ptes(tsk->stack, THREAD_SIZE);

And this seems to work. I am clearing the mapping when the VM vcpu is freed,
so I am making the assumption that the same task is used to create and free
a vcpu.

vCPUs are bound to an mm but not a specific task, right?  So I think
this is wrong in both directions.

I know, that was yet another shortcut for the POC, I assume there's a 1:1
mapping between a vCPU and task, but I think that's fair with qemu.

Suppose a vCPU is created, then the task exits, the stack mapping gets
freed (the core code tries to avoid this, but it does happen), and a
new stack gets allocated at the same VA with different physical pages.
Now you're toast :)  On the flip side, wouldn't you crash if a vCPU is
created and then run on a different thread?

Yes, that's why I have a safety net: before entering KVM isolation I always
check that the current task is mapped in the KVM address space, if not it
gets mapped.

How important is the ability to enable IRQs while running with the KVM
page tables?

I can't say, I would need to check but we probably need IRQs at least for
some timers. Sounds like you would really prefer IRQs to be disabled.

alex.