On Tue, May 14, 2019 at 2:42 AM Alexandre Chartre <alexandre.chartre@xxxxxxxxxx> wrote: > > > On 5/14/19 10:34 AM, Andy Lutomirski wrote: > > > > > >> On May 14, 2019, at 1:25 AM, Alexandre Chartre <alexandre.chartre@xxxxxxxxxx> wrote: > >> > >> > >>> On 5/14/19 9:09 AM, Peter Zijlstra wrote: > >>>> On Mon, May 13, 2019 at 11:18:41AM -0700, Andy Lutomirski wrote: > >>>> On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre > >>>> <alexandre.chartre@xxxxxxxxxx> wrote: > >>>>> > >>>>> pcpu_base_addr is already mapped to the KVM address space, but this > >>>>> represents the first percpu chunk. To access a per-cpu buffer not > >>>>> allocated in the first chunk, add a function which maps all cpu > >>>>> buffers corresponding to that per-cpu buffer. > >>>>> > >>>>> Also add function to clear page table entries for a percpu buffer. > >>>>> > >>>> > >>>> This needs some kind of clarification so that readers can tell whether > >>>> you're trying to map all percpu memory or just map a specific > >>>> variable. In either case, you're making a dubious assumption that > >>>> percpu memory contains no secrets. > >>> I'm thinking the per-cpu random pool is a secrit. IOW, it demonstrably > >>> does contain secrits, invalidating that premise. > >> > >> The current code unconditionally maps the entire first percpu chunk > >> (pcpu_base_addr). So it assumes it doesn't contain any secret. That is > >> mainly a simplification for the POC because a lot of core information > >> that we need, for example just to switch mm, are stored there (like > >> cpu_tlbstate, current_task...). > > > > I don’t think you should need any of this. > > > > At the moment, the current code does need it. Otherwise it can't switch from > kvm mm to kernel mm: switch_mm_irqs_off() will fault accessing "cpu_tlbstate", > and then the page fault handler will fail accessing "current" before calling > the kvm page fault handler. So it will double fault or loop on page faults. > There are many different places where percpu variables are used, and I have > experienced many double fault/page fault loop because of that. Now you're experiencing what working on the early PTI code was like :) This is why I think you shouldn't touch current in any of this. > > >> > >> If the entire first percpu chunk effectively has secret then we will > >> need to individually map only buffers we need. The kvm_copy_percpu_mapping() > >> function is added to copy mapping for a specified percpu buffer, so > >> this used to map percpu buffers which are not in the first percpu chunk. > >> > >> Also note that mapping is constrained by PTE (4K), so mapped buffers > >> (percpu or not) which do not fill a whole set of pages can leak adjacent > >> data store on the same pages. > >> > >> > > > > I would take a different approach: figure out what you need and put it in its > > own dedicated area, kind of like cpu_entry_area. > > That's certainly something we can do, like Julian proposed with "Process-local > memory allocations": https://lkml.org/lkml/2018/11/22/1240 > > That's fine for buffers allocated from KVM, however, we will still need some > core kernel mappings so the thread can run and interrupts can be handled. > > > One nasty issue you’ll have is vmalloc: the kernel stack is in the > > vmap range, and, if you allow access to vmap memory at all, you’ll > > need some way to ensure that *unmap* gets propagated. I suspect the > > right choice is to see if you can avoid using the kernel stack at all > > in isolated mode. Maybe you could run on the IRQ stack instead. > > I am currently just copying the task stack mapping into the KVM page table > (patch 23) when a vcpu is created: > > err = kvm_copy_ptes(tsk->stack, THREAD_SIZE); > > And this seems to work. I am clearing the mapping when the VM vcpu is freed, > so I am making the assumption that the same task is used to create and free > a vcpu. > vCPUs are bound to an mm but not a specific task, right? So I think this is wrong in both directions. Suppose a vCPU is created, then the task exits, the stack mapping gets freed (the core code tries to avoid this, but it does happen), and a new stack gets allocated at the same VA with different physical pages. Now you're toast :) On the flip side, wouldn't you crash if a vCPU is created and then run on a different thread? How important is the ability to enable IRQs while running with the KVM page tables?