Re: [PATCH] riscv: mm: Fixup spurious fault of kernel vaddr

Guo Ren <guoren@xxxxxxxxxx> · Sat, 29 Jul 2023 14:40:11 +0800

Sorry for the late reply, Alexandre. I'm busy with other suffs.

On Mon, Jul 24, 2023 at 5:05 PM Alexandre Ghiti <alex@xxxxxxxx> wrote:
>
>
> On 22/07/2023 01:59, Guo Ren wrote:
> > On Fri, Jul 21, 2023 at 4:01 PM Alexandre Ghiti <alex@xxxxxxxx> wrote:
> >>
> >> On 21/07/2023 18:08, Guo Ren wrote:
> >>> On Fri, Jul 21, 2023 at 11:19 PM Alexandre Ghiti <alex@xxxxxxxx> wrote:
> >>>> On 21/07/2023 16:51, guoren@xxxxxxxxxx wrote:
> >>>>> From: Guo Ren <guoren@xxxxxxxxxxxxxxxxx>
> >>>>>
> >>>>> RISC-V specification permits the caching of PTEs whose V (Valid)
> >>>>> bit is clear. Operating systems must be written to cope with this
> >>>>> possibility, but implementers are reminded that eagerly caching
> >>>>> invalid PTEs will reduce performance by causing additional page
> >>>>> faults.
> >>>>>
> >>>>> So we must keep vmalloc_fault for the spurious page faults of kernel
> >>>>> virtual address from an OoO machine.
> >>>>>
> >>>>> Signed-off-by: Guo Ren <guoren@xxxxxxxxxxxxxxxxx>
> >>>>> Signed-off-by: Guo Ren <guoren@xxxxxxxxxx>
> >>>>> ---
> >>>>>     arch/riscv/mm/fault.c | 3 +--
> >>>>>     1 file changed, 1 insertion(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
> >>>>> index 85165fe438d8..f662c9eae7d4 100644
> >>>>> --- a/arch/riscv/mm/fault.c
> >>>>> +++ b/arch/riscv/mm/fault.c
> >>>>> @@ -258,8 +258,7 @@ void handle_page_fault(struct pt_regs *regs)
> >>>>>          * only copy the information from the master page table,
> >>>>>          * nothing more.
> >>>>>          */
> >>>>> -     if ((!IS_ENABLED(CONFIG_MMU) || !IS_ENABLED(CONFIG_64BIT)) &&
> >>>>> -         unlikely(addr >= VMALLOC_START && addr < VMALLOC_END)) {
> >>>>> +     if (unlikely(addr >= TASK_SIZE)) {
> >>>>>                 vmalloc_fault(regs, code, addr);
> >>>>>                 return;
> >>>>>         }
> >>>> Can you share what you are trying to fix here?
> >>> We met a spurious page fault panic on an OoO machine.
> >>>
> >>> 1. The processor speculative execution brings the V=0 entries into the
> >>> TLB in the kernel virtual address.
> >>> 2. Linux kernel installs the kernel virtual address with the page, and V=1
> >>> 3. When kernel code access the kernel virtual address, it would raise
> >>> a page fault as the V=0 entry in the tlb.
> >>> 4. No vmalloc_fault, then panic.
> >>>
> >>>> I have a fix (that's currently running our CI) for commit 7d3332be011e
> >>>> ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area") that
> >>>> implements flush_cache_vmap() since we lost the vmalloc_fault.
> >>> Could you share that patch?
> >>
> >> Here we go:
> >>
> >>
> >> Author: Alexandre Ghiti <alexghiti@xxxxxxxxxxxx>
> >> Date:   Fri Jul 21 08:43:44 2023 +0000
> >>
> >>       riscv: Implement flush_cache_vmap()
> >>
> >>       The RISC-V kernel needs a sfence.vma after a page table
> >> modification: we
> >>       used to rely on the vmalloc fault handling to emit an sfence.vma, but
> >>       commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
> >>       vmalloc/modules area") got rid of this path for 64-bit kernels, so
> >> now we
> >>       need to explicitly emit a sfence.vma in flush_cache_vmap().
> >>
> >>       Note that we don't need to implement flush_cache_vunmap() as the
> >> generic
> >>       code should emit a flush tlb after unmapping a vmalloc region.
> >>
> >>       Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
> >> vmalloc/modules area")
> >>       Signed-off-by: Alexandre Ghiti <alexghiti@xxxxxxxxxxxx>
> >>
> >> diff --git a/arch/riscv/include/asm/cacheflush.h
> >> b/arch/riscv/include/asm/cacheflush.h
> >> index 8091b8bf4883..b93ffddf8a61 100644
> >> --- a/arch/riscv/include/asm/cacheflush.h
> >> +++ b/arch/riscv/include/asm/cacheflush.h
> >> @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
> >>    #define flush_icache_user_page(vma, pg, addr, len) \
> >>           flush_icache_mm(vma->vm_mm, 0)
> >>
> >> +#ifdef CONFIG_64BIT
> >> +#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end)
> >> +#endif
> > I don't want that, and flush_tlb_kernel_range is flush_tlb_all. In
> > addition, it would call IPI, which is a performance killer.
>
>
> At the moment, flush_tlb_kernel_range() indeed calls flush_tlb_all() but
> that needs to be fixed, see my last patchset
> https://lore.kernel.org/linux-riscv/20230711075434.10936-1-alexghiti@xxxxxxxxxxxx/.
>
> But can you at least check that this fixes your issue? It would be
> interesting to see if the problem comes from vmalloc or something else.
It could solve my issue.

>
>
> > What's the problem of spurious fault replay? It only costs a
> > local_tlb_flush with vaddr.
>
>
> We had this exact discussion internally this week, and the fault replay
> seems like a solution. But that needs to be thought carefully: the
> vmalloc fault was removed for a reason (see Bjorn commit log), tracing
> functions can use vmalloc() in the path of the vmalloc fault, causing an
> infinite trap loop. And here you are simply re-enabling this problem.
Thx for mentioning it, and I will solve it in the next version of the patch:

-static inline void vmalloc_fault(struct pt_regs *regs, int code,
unsigned long addr)
+static void notrace vmalloc_fault(struct pt_regs *regs, int code,
unsigned long addr)

> In
> addition, this patch makes vmalloc_fault() catch *all* kernel faults in
> the kernel address space, so any genuine kernel fault would loop forever
> in vmalloc_fault().
We check whether kernel vaddr is valid by the page_table, not range.
I'm sure "the any genuine kernel fault would loop forever in
vmalloc_fault()" is about what? Could you give an example?

>
>
> For now, the simplest solution is to implement flush_cache_vmap()
> because riscv needs a sfence.vma when adding a new mapping, and if
It's not a local_tlb_flush, and it would ipi broadcast all harts.
on_each_cpu(__ipi_flush_tlb_all, NULL, 1);

That's too horrible.

Some custom drivers or test codes may care about it.

> that's a "performance killer", let's measure that and implement
> something like this patch is trying to do. I may be wrong, but there
> aren't many new kernel mappings that would require a call to
> flush_cache_vmap() so I disagree with the performance killer argument,
> but happy to be proven otherwise!

1. I agree to pre-allocate pgd entries. It's good for performance, but
don't do that when Sv32.
2. We still need vmalloc_fault to match ISA spec requirements. (Some
vendors' microarchitectures (e.g., T-HEAD c910) could prevent V=0 into
TLB when PTW, then they don't need it.)
3. Only when vmalloc_fault can't solve the problem, then let's think
about the flush_cache_vmap() solution.

>
> Thanks,
>
> Alex
>
>
> >
> >> +
> >>    #ifndef CONFIG_SMP
> >>
> >>    #define flush_icache_all() local_flush_icache_all()
> >>
> >>
> >> Let me know if that works for you!
> >>
> >>
> >>>
> > --
> > Best Regards
> >   Guo Ren

--
Best Regards
 Guo Ren