Re: [RFC KVM 18/27] kvm/isolation: function to copy page table entries for percpu buffer

Andy Lutomirski <luto@xxxxxxxxxx> · Tue, 14 May 2019 08:23:48 -0700

On Tue, May 14, 2019 at 2:42 AM Alexandre Chartre
<alexandre.chartre@xxxxxxxxxx> wrote:
>
>
> On 5/14/19 10:34 AM, Andy Lutomirski wrote:
> >
> >
> >> On May 14, 2019, at 1:25 AM, Alexandre Chartre <alexandre.chartre@xxxxxxxxxx> wrote:
> >>
> >>
> >>> On 5/14/19 9:09 AM, Peter Zijlstra wrote:
> >>>> On Mon, May 13, 2019 at 11:18:41AM -0700, Andy Lutomirski wrote:
> >>>> On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
> >>>> <alexandre.chartre@xxxxxxxxxx> wrote:
> >>>>>
> >>>>> pcpu_base_addr is already mapped to the KVM address space, but this
> >>>>> represents the first percpu chunk. To access a per-cpu buffer not
> >>>>> allocated in the first chunk, add a function which maps all cpu
> >>>>> buffers corresponding to that per-cpu buffer.
> >>>>>
> >>>>> Also add function to clear page table entries for a percpu buffer.
> >>>>>
> >>>>
> >>>> This needs some kind of clarification so that readers can tell whether
> >>>> you're trying to map all percpu memory or just map a specific
> >>>> variable.  In either case, you're making a dubious assumption that
> >>>> percpu memory contains no secrets.
> >>> I'm thinking the per-cpu random pool is a secrit. IOW, it demonstrably
> >>> does contain secrits, invalidating that premise.
> >>
> >> The current code unconditionally maps the entire first percpu chunk
> >> (pcpu_base_addr). So it assumes it doesn't contain any secret. That is
> >> mainly a simplification for the POC because a lot of core information
> >> that we need, for example just to switch mm, are stored there (like
> >> cpu_tlbstate, current_task...).
> >
> > I don’t think you should need any of this.
> >
>
> At the moment, the current code does need it. Otherwise it can't switch from
> kvm mm to kernel mm: switch_mm_irqs_off() will fault accessing "cpu_tlbstate",
> and then the page fault handler will fail accessing "current" before calling
> the kvm page fault handler. So it will double fault or loop on page faults.
> There are many different places where percpu variables are used, and I have
> experienced many double fault/page fault loop because of that.

Now you're experiencing what working on the early PTI code was like :)

This is why I think you shouldn't touch current in any of this.

>
> >>
> >> If the entire first percpu chunk effectively has secret then we will
> >> need to individually map only buffers we need. The kvm_copy_percpu_mapping()
> >> function is added to copy mapping for a specified percpu buffer, so
> >> this used to map percpu buffers which are not in the first percpu chunk.
> >>
> >> Also note that mapping is constrained by PTE (4K), so mapped buffers
> >> (percpu or not) which do not fill a whole set of pages can leak adjacent
> >> data store on the same pages.
> >>
> >>
> >
> > I would take a different approach: figure out what you need and put it in its
> > own dedicated area, kind of like cpu_entry_area.
>
> That's certainly something we can do, like Julian proposed with "Process-local
> memory allocations": https://lkml.org/lkml/2018/11/22/1240
>
> That's fine for buffers allocated from KVM, however, we will still need some
> core kernel mappings so the thread can run and interrupts can be handled.
>
> > One nasty issue you’ll have is vmalloc: the kernel stack is in the
> > vmap range, and, if you allow access to vmap memory at all, you’ll
> > need some way to ensure that *unmap* gets propagated. I suspect the
> > right choice is to see if you can avoid using the kernel stack at all
> > in isolated mode.  Maybe you could run on the IRQ stack instead.
>
> I am currently just copying the task stack mapping into the KVM page table
> (patch 23) when a vcpu is created:
>
>         err = kvm_copy_ptes(tsk->stack, THREAD_SIZE);
>
> And this seems to work. I am clearing the mapping when the VM vcpu is freed,
> so I am making the assumption that the same task is used to create and free
> a vcpu.
>

vCPUs are bound to an mm but not a specific task, right?  So I think
this is wrong in both directions.

Suppose a vCPU is created, then the task exits, the stack mapping gets
freed (the core code tries to avoid this, but it does happen), and a
new stack gets allocated at the same VA with different physical pages.
Now you're toast :)  On the flip side, wouldn't you crash if a vCPU is
created and then run on a different thread?

How important is the ability to enable IRQs while running with the KVM
page tables?