On Thu, Jan 25, 2018 at 02:00:22PM -0800, Andy Lutomirski wrote: > On Thu, Jan 25, 2018 at 1:49 PM, Dave Hansen <dave.hansen@xxxxxxxxx> wrote: > > On 01/25/2018 01:12 PM, Andy Lutomirski wrote: > >> Neil Berrington reported a double-fault on a VM with 768GB of RAM that > >> uses large amounts of vmalloc space with PTI enabled. > >> > >> The cause is that load_new_mm_cr3() was never fixed to take the > >> 5-level pgd folding code into account, so, on a 4-level kernel, the > >> pgd synchronization logic compiles away to exactly nothing. > > > > You don't mention it, but we can normally handle vmalloc() faults in the > > kernel that are due to unsynchronized page tables. The thing that kills > > us here is that we have an unmapped stack and we try to use that stack > > when entering the page fault handler, which double faults. The double > > fault handler gets a new stack and saves us enough to get an oops out. > > > > Right? > > Exactly. > > There are two special code paths that can't use vmalloc_fault(): this > one and switch_to(). The latter avoids explicit page table fiddling > and just touches the new stack before loading it into rsp. > > > > >> +static void sync_current_stack_to_mm(struct mm_struct *mm) > >> +{ > >> + unsigned long sp = current_stack_pointer; > >> + pgd_t *pgd = pgd_offset(mm, sp); > >> + > >> + if (CONFIG_PGTABLE_LEVELS > 4) { > >> + if (unlikely(pgd_none(*pgd))) { > >> + pgd_t *pgd_ref = pgd_offset_k(sp); > >> + > >> + set_pgd(pgd, *pgd_ref); > >> + } > >> + } else { > >> + /* > >> + * "pgd" is faked. The top level entries are "p4d"s, so sync > >> + * the p4d. This compiles to approximately the same code as > >> + * the 5-level case. > >> + */ > >> + p4d_t *p4d = p4d_offset(pgd, sp); > >> + > >> + if (unlikely(p4d_none(*p4d))) { > >> + pgd_t *pgd_ref = pgd_offset_k(sp); > >> + p4d_t *p4d_ref = p4d_offset(pgd_ref, sp); > >> + > >> + set_p4d(p4d, *p4d_ref); > >> + } > >> + } > >> +} > > > > We keep having to add these. It seems like a real deficiency in the > > mechanism that we're using for pgd folding. Can't we get a warning or > > something when we try to do a set_pgd() that's (silently) not doing > > anything? This exact same pattern bit me more than once with the > > KPTI/KAISER patches. > > Hmm, maybe. > > What I'd really like to see is an entirely different API. Maybe: > > typedef struct { > opaque, but probably includes: > int depth; /* 0 is root */ > void *table; > } ptbl_ptr; > > ptbl_ptr root_table = mm_root_ptbl(mm); > > set_ptbl_entry(root_table, pa, prot); > > /* walk tables */ > ptbl_ptr pt = ...; > ptentry_ptr entry; > while (ptbl_has_children(pt)) { > pt = pt_next(pt, addr); > } > entry = pt_entry_at(pt, addr); > /* do something with entry */ > > etc. I thought about very similar design, but never got time to try it really. It's not one-week-end type of project :/ -- Kirill A. Shutemov