On Wed, Jul 5, 2023 at 8:06 PM Alexandre Ghiti <alex@xxxxxxxx> wrote: > > + fixed Paul's address > > > On 05/07/2023 11:15, Guo Ren wrote: > > On Wed, Jul 5, 2023 at 3:01 PM Alexandre Ghiti <alex@xxxxxxxx> wrote: > >> > >> On 04/07/2023 04:25, Guo Ren wrote: > >>> On Mon, Jul 3, 2023 at 6:17 PM Alexandre Ghiti <alex@xxxxxxxx> wrote: > >>>> Hi Guo, > >>>> > >>>> On 29/06/2023 10:20, guoren@xxxxxxxxxx wrote: > >>>>> From: Guo Ren <guoren@xxxxxxxxxxxxxxxxx> > >>>>> > >>>>> The machine_kexec() uses set_memory_x to add the executable attribute to the > >>>>> page table entry of control_code_buffer. It only modifies the init_mm but not > >>>>> the current->active_mm. The current kexec process won't use init_mm directly, > >>>>> and it depends on minor_pagefault, which is removed by commit 7d3332be011e4 > >>>> Is the removal of minor_pagefault an issue? I'm not sure I understand > >>>> this part of the changelog. > >>> I use two different work-around patches to answer your question: > >>> 1st: > >>> ----- > >>> diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c > >>> index 705d63a59aec..b8b200c81606 100644 > >>> --- a/arch/riscv/mm/fault.c > >>> +++ b/arch/riscv/mm/fault.c > >>> @@ -249,7 +249,7 @@ void handle_page_fault(struct pt_regs *regs) > >>> * only copy the information from the master page table, > >>> * nothing more. > >>> */ > >>> - if (unlikely((addr >= VMALLOC_START) && (addr < VMALLOC_END))) { > >>> + if (unlikely(addr >= 0x8000000000000000UL)) { > >>> vmalloc_fault(regs, code, addr); > >>> return; > >>> } > >>> ------ > >>> > >>> 2nd: > >>> ------ > >>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c > >>> index 8e65f0a953e5..270f50852886 100644 > >>> --- a/arch/riscv/mm/init.c > >>> +++ b/arch/riscv/mm/init.c > >>> @@ -1387,7 +1387,7 @@ static void __init create_linear_mapping_page_table(void) > >>> if (end >= __pa(PAGE_OFFSET) + memory_limit) > >>> end = __pa(PAGE_OFFSET) + memory_limit; > >>> > >>> - create_linear_mapping_range(start, end, 0); > >>> + create_linear_mapping_range(start, end, PMD_SIZE); > >>> } > >>> > >>> #ifdef CONFIG_STRICT_KERNEL_RWX > >>> ----- > >>> > >>> The removal of minor_pagefault could be an issue, but in this case > >>> it's the VMALLOC_START/END which prevents the minor_pagefault at > >>> first. I didn't say commit 7d3332be011e4 is the problem. > >> > >> Sorry I still don't understand what you mean here and why you mention > >> the minor pagefault, could you explain again please? > >> > >> > >>>>> ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area") of 64BIT. So, > >>>>> when it met pud mapping on an MMU_SV39 machine, it caused the following: > >>>>> > >>>>> kexec_core: Starting new kernel > >>>>> Will call new kernel at 00300000 from hart id 0 > >>>>> FDT image at 747c7000 > >>>>> Bye... > >>>>> Unable to handle kernel paging request at virtual address ffffffda23b0d000 > >>>>> Oops [#1] > >>>>> Modules linked in: > >>>>> CPU: 0 PID: 53 Comm: uinit Not tainted 6.4.0-rc6 #15 > >>>>> Hardware name: Sophgo Mango (DT) > >>>>> epc : 0xffffffda23b0d000 > >>>>> ra : machine_kexec+0xa6/0xb0 > >>>>> epc : ffffffda23b0d000 ra : ffffffff80008272 sp : ffffffc80c173d10 > >>>>> gp : ffffffff8150e1e0 tp : ffffffd9073d2c40 t0 : 0000000000000000 > >>>>> t1 : 0000000000000042 t2 : 6567616d69205444 s0 : ffffffc80c173d50 > >>>>> s1 : ffffffd9076c4800 a0 : ffffffd9076c4800 a1 : 0000000000300000 > >>>>> a2 : 00000000747c7000 a3 : 0000000000000000 a4 : ffffffd800000000 > >>>>> a5 : 0000000000000000 a6 : ffffffd903619c40 a7 : ffffffffffffffff > >>>>> s2 : ffffffda23b0d000 s3 : 0000000000300000 s4 : 00000000747c7000 > >>>>> s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000000 > >>>>> s8 : 0000000000000000 s9 : 0000000000000000 s10: 0000000000000000 > >>>>> s11: 0000003f940001a0 t3 : ffffffff815351af t4 : ffffffff815351af > >>>>> t5 : ffffffff815351b0 t6 : ffffffc80c173b50 > >>>>> status: 0000000200000100 badaddr: ffffffda23b0d000 cause: 000000000000000c > >>>>> > >>>>> Yes, Using set_memory_x API after boot has the limitation, and at least we > >>>>> should synchronize the current->active_mm to fix the problem. > >>>>> > >>>>> Fixes: d3ab332a5021 ("riscv: add ARCH_HAS_SET_MEMORY support") > >>>>> Signed-off-by: Guo Ren <guoren@xxxxxxxxxxxxxxxxx> > >>>>> Signed-off-by: Guo Ren <guoren@xxxxxxxxxx> > >>>>> --- > >>>>> arch/riscv/mm/pageattr.c | 7 +++++++ > >>>>> 1 file changed, 7 insertions(+) > >>>>> > >>>>> diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c > >>>>> index ea3d61de065b..23d169c4ee81 100644 > >>>>> --- a/arch/riscv/mm/pageattr.c > >>>>> +++ b/arch/riscv/mm/pageattr.c > >>>>> @@ -123,6 +123,13 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, > >>>>> &masks); > >>>>> mmap_write_unlock(&init_mm); > >>>>> > >>>>> + if (current->active_mm != &init_mm) { > >>>>> + mmap_write_lock(current->active_mm); > >>>>> + ret = walk_page_range_novma(current->active_mm, start, end, > >>>>> + &pageattr_ops, NULL, &masks); > >>>>> + mmap_write_unlock(current->active_mm); > >>>>> + } > >>>>> + > >>>>> flush_tlb_kernel_range(start, end); > >>>>> > >>>>> return ret; > >>>> I don't understand: any page table inherits the entries of > >>>> swapper_pg_dir (see pgd_alloc()), so any kernel page table entry is > >>>> "automatically" synchronized, so why should we synchronize one 4K entry > >>>> explicitly? A PGD entry would need to be synced, but not a PTE entry. > >>> The purpose of the second walk_page_range_novma() is for pgd's entries > >>> synchronization. I'm a bit lazy here, I agree, it's unnecessary to > >>> write lower level entries again. So I would use a simple pgd entries > >>> synchronization from vmalloc_fault in the next version of patch, all > >>> right? > >> > >> But vmalloc_fault was removed by commit 7d3332be011e4 for CONFIG_64BIT, > >> so I don't get it: why would we need to synchronize a PGD entry in your > >> case? Where does this new PGD come from? And the trap address is > >> ffffffda23b0d000, which lies in the direct mapping, so why do you > >> mention vmalloc_fault at all? > > The machine_kexec() uses set_memory_x to modify the direct mapping > > attributes from RW to RWX. But set_memory_x only changes the init_mm's > > attributes, not current->active_mm, so when kexec jumps into > > control_buffer, the instruction page fault happens, and there is no > > minor_pagefault for it, then panic. > > > > I found the bug on an MMU_sv39 machine, and the direct mapping used a > > 1GB PUD, the pgd entries. This patch could solve the problem by > > synchronizing a PGD entry between init_mm and current->active_mm. > > > Ok I get it now, thanks. So a few thoughts: > > - the trap address is in the linear mapping, so there won't be any sync > in the trap handler. Even if we implemented such behavior, waiting for a > fault to synchronize those mappings is error-prone (see commit > 7d3332be011e4) > > - this is Fixes: 3335068f8721 ("riscv: Use PUD/P4D/PGD pages for the > linear mapping") > > - that only happens if we use the largest mapping possible size for the > current mode: one simple solution is to restrict to the second largest > size (2MB for sv39, 1GB for sv48, 512GB for sv57). > > - your solution only works in this case, but we should fix the larger > problem: we need to sync all the page tables if such things happen, not > just the "calling" one. > > - set_memory* changing the protection of a whole PUD mapping for a > single page is a problem: if we set_memory_ro() a single page, we end up > with a 1GB linear mapping set to read only (which will obviously be > problematic!) > > - can we really use set_memory* on the direct mapping? Maybe we should > simply not, and the solution is to fix machine_kexec() so it remaps this > page outside the linear mapping. Okay, I agree to limit usage of set_memory*. And I would let machine_kexec() remaps control_buffer_page in the next version of patch instead of modifing linear mapping. > > - at the moment, we do not deal with the linear mapping in set_memory* > (for example, when used with modules, we protect the module mapping in > the vmalloc region but don't do anything with its linear mapping alias, > I have that on my todo list for a moment now). > > The best very short-term solution to me is to restrict to the second > largest size (easy to hack that in best_map_size()). > > And to me the long term solution is to split those mappings: that fixes > all the issues mentioned above + the thing with the modules (but I have > to dig a bit more, no other archs seems to do that). > > That's interesting, thanks for bringing that up, let me know what you > think, I'll keep digging! > > Alex > > > > > >> Sorry if I'm missing something, hope you can clarify things! > >> > >> Thanks, > >> > >> Alex > >> > >> > >>> > >>> -- > >>> Best Regards > >>> Guo Ren > >>> > >>> _______________________________________________ > >>> linux-riscv mailing list > >>> linux-riscv@xxxxxxxxxxxxxxxxxxx > >>> http://lists.infradead.org/mailman/listinfo/linux-riscv > > > > -- Best Regards Guo Ren