Sorry for the late reply, Alexandre. I'm busy with other suffs. On Mon, Jul 24, 2023 at 5:05 PM Alexandre Ghiti <alex@xxxxxxxx> wrote: > > > On 22/07/2023 01:59, Guo Ren wrote: > > On Fri, Jul 21, 2023 at 4:01 PM Alexandre Ghiti <alex@xxxxxxxx> wrote: > >> > >> On 21/07/2023 18:08, Guo Ren wrote: > >>> On Fri, Jul 21, 2023 at 11:19 PM Alexandre Ghiti <alex@xxxxxxxx> wrote: > >>>> On 21/07/2023 16:51, guoren@xxxxxxxxxx wrote: > >>>>> From: Guo Ren <guoren@xxxxxxxxxxxxxxxxx> > >>>>> > >>>>> RISC-V specification permits the caching of PTEs whose V (Valid) > >>>>> bit is clear. Operating systems must be written to cope with this > >>>>> possibility, but implementers are reminded that eagerly caching > >>>>> invalid PTEs will reduce performance by causing additional page > >>>>> faults. > >>>>> > >>>>> So we must keep vmalloc_fault for the spurious page faults of kernel > >>>>> virtual address from an OoO machine. > >>>>> > >>>>> Signed-off-by: Guo Ren <guoren@xxxxxxxxxxxxxxxxx> > >>>>> Signed-off-by: Guo Ren <guoren@xxxxxxxxxx> > >>>>> --- > >>>>> arch/riscv/mm/fault.c | 3 +-- > >>>>> 1 file changed, 1 insertion(+), 2 deletions(-) > >>>>> > >>>>> diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c > >>>>> index 85165fe438d8..f662c9eae7d4 100644 > >>>>> --- a/arch/riscv/mm/fault.c > >>>>> +++ b/arch/riscv/mm/fault.c > >>>>> @@ -258,8 +258,7 @@ void handle_page_fault(struct pt_regs *regs) > >>>>> * only copy the information from the master page table, > >>>>> * nothing more. > >>>>> */ > >>>>> - if ((!IS_ENABLED(CONFIG_MMU) || !IS_ENABLED(CONFIG_64BIT)) && > >>>>> - unlikely(addr >= VMALLOC_START && addr < VMALLOC_END)) { > >>>>> + if (unlikely(addr >= TASK_SIZE)) { > >>>>> vmalloc_fault(regs, code, addr); > >>>>> return; > >>>>> } > >>>> Can you share what you are trying to fix here? > >>> We met a spurious page fault panic on an OoO machine. > >>> > >>> 1. The processor speculative execution brings the V=0 entries into the > >>> TLB in the kernel virtual address. > >>> 2. Linux kernel installs the kernel virtual address with the page, and V=1 > >>> 3. When kernel code access the kernel virtual address, it would raise > >>> a page fault as the V=0 entry in the tlb. > >>> 4. No vmalloc_fault, then panic. > >>> > >>>> I have a fix (that's currently running our CI) for commit 7d3332be011e > >>>> ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area") that > >>>> implements flush_cache_vmap() since we lost the vmalloc_fault. > >>> Could you share that patch? > >> > >> Here we go: > >> > >> > >> Author: Alexandre Ghiti <alexghiti@xxxxxxxxxxxx> > >> Date: Fri Jul 21 08:43:44 2023 +0000 > >> > >> riscv: Implement flush_cache_vmap() > >> > >> The RISC-V kernel needs a sfence.vma after a page table > >> modification: we > >> used to rely on the vmalloc fault handling to emit an sfence.vma, but > >> commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for > >> vmalloc/modules area") got rid of this path for 64-bit kernels, so > >> now we > >> need to explicitly emit a sfence.vma in flush_cache_vmap(). > >> > >> Note that we don't need to implement flush_cache_vunmap() as the > >> generic > >> code should emit a flush tlb after unmapping a vmalloc region. > >> > >> Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for > >> vmalloc/modules area") > >> Signed-off-by: Alexandre Ghiti <alexghiti@xxxxxxxxxxxx> > >> > >> diff --git a/arch/riscv/include/asm/cacheflush.h > >> b/arch/riscv/include/asm/cacheflush.h > >> index 8091b8bf4883..b93ffddf8a61 100644 > >> --- a/arch/riscv/include/asm/cacheflush.h > >> +++ b/arch/riscv/include/asm/cacheflush.h > >> @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page) > >> #define flush_icache_user_page(vma, pg, addr, len) \ > >> flush_icache_mm(vma->vm_mm, 0) > >> > >> +#ifdef CONFIG_64BIT > >> +#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end) > >> +#endif > > I don't want that, and flush_tlb_kernel_range is flush_tlb_all. In > > addition, it would call IPI, which is a performance killer. > > > At the moment, flush_tlb_kernel_range() indeed calls flush_tlb_all() but > that needs to be fixed, see my last patchset > https://lore.kernel.org/linux-riscv/20230711075434.10936-1-alexghiti@xxxxxxxxxxxx/. > > But can you at least check that this fixes your issue? It would be > interesting to see if the problem comes from vmalloc or something else. It could solve my issue. > > > > What's the problem of spurious fault replay? It only costs a > > local_tlb_flush with vaddr. > > > We had this exact discussion internally this week, and the fault replay > seems like a solution. But that needs to be thought carefully: the > vmalloc fault was removed for a reason (see Bjorn commit log), tracing > functions can use vmalloc() in the path of the vmalloc fault, causing an > infinite trap loop. And here you are simply re-enabling this problem. Thx for mentioning it, and I will solve it in the next version of the patch: -static inline void vmalloc_fault(struct pt_regs *regs, int code, unsigned long addr) +static void notrace vmalloc_fault(struct pt_regs *regs, int code, unsigned long addr) > In > addition, this patch makes vmalloc_fault() catch *all* kernel faults in > the kernel address space, so any genuine kernel fault would loop forever > in vmalloc_fault(). We check whether kernel vaddr is valid by the page_table, not range. I'm sure "the any genuine kernel fault would loop forever in vmalloc_fault()" is about what? Could you give an example? > > > For now, the simplest solution is to implement flush_cache_vmap() > because riscv needs a sfence.vma when adding a new mapping, and if It's not a local_tlb_flush, and it would ipi broadcast all harts. on_each_cpu(__ipi_flush_tlb_all, NULL, 1); That's too horrible. Some custom drivers or test codes may care about it. > that's a "performance killer", let's measure that and implement > something like this patch is trying to do. I may be wrong, but there > aren't many new kernel mappings that would require a call to > flush_cache_vmap() so I disagree with the performance killer argument, > but happy to be proven otherwise! 1. I agree to pre-allocate pgd entries. It's good for performance, but don't do that when Sv32. 2. We still need vmalloc_fault to match ISA spec requirements. (Some vendors' microarchitectures (e.g., T-HEAD c910) could prevent V=0 into TLB when PTW, then they don't need it.) 3. Only when vmalloc_fault can't solve the problem, then let's think about the flush_cache_vmap() solution. > > Thanks, > > Alex > > > > > >> + > >> #ifndef CONFIG_SMP > >> > >> #define flush_icache_all() local_flush_icache_all() > >> > >> > >> Let me know if that works for you! > >> > >> > >>> > > -- > > Best Regards > > Guo Ren -- Best Regards Guo Ren