On Sat, Aug 14, 2021 at 8:59 AM Fabiano Rosas <farosas@xxxxxxxxxxxxx> wrote: > > Laurent Vivier <lvivier@xxxxxxxxxx> writes: > > > > > since this patch is merged my VM is experiencing a crash at boot (20% of the time): > > > > [ 8.496850] kernel tried to execute exec-protected page (c008000004073278) - exploit > > attempt? (uid: 0) > > [ 8.496921] BUG: Unable to handle kernel instruction fetch > > [ 8.496954] Faulting instruction address: 0xc008000004073278 > > [ 8.496994] Oops: Kernel access of bad area, sig: 11 [#1] > > [ 8.497028] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries > > [ 8.497071] Modules linked in: drm virtio_console fuse drm_panel_orientation_quirks xfs > > libcrc32c virtio_net net_failover virtio_blk vmx_crypto failover dm_mirror dm_region_hash > > dm_log dm_mod > > [ 8.497186] CPU: 3 PID: 44 Comm: kworker/3:1 Not tainted 5.14.0-rc4+ #12 > > [ 8.497228] Workqueue: events control_work_handler [virtio_console] > > [ 8.497272] NIP: c008000004073278 LR: c008000004073278 CTR: c0000000001e9de0 > > [ 8.497320] REGS: c00000002e4ef7e0 TRAP: 0400 Not tainted (5.14.0-rc4+) > > [ 8.497361] MSR: 800000004280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24002822 > > XER: 200400cf > > [ 8.497426] CFAR: c0000000001e9e44 IRQMASK: 1 > > [ 8.497426] GPR00: c008000004073278 c00000002e4efa80 c000000002a26b00 c000000042c39520 > > [ 8.497426] GPR04: 0000000000000001 0000000000000000 0000000000000000 00000000000000ff > > [ 8.497426] GPR08: 0000000000000001 c000000042c39520 0000000000000001 c008000004076008 > > [ 8.497426] GPR12: c0000000001e9de0 c0000001fffccb00 c00000000018ba88 c00000002c91d400 > > [ 8.497426] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > [ 8.497426] GPR20: 0000000000000000 0000000000000000 0000000000000000 c008000004080340 > > [ 8.497426] GPR24: c0080000040a01e8 0000000000000000 0000000000000000 c00000002e0975c0 > > [ 8.497426] GPR28: c00000002ce72940 c000000042c39520 0000000000000048 0000000000000038 > > [ 8.497891] NIP [c008000004073278] fill_queue+0xf0/0x210 [virtio_console] > > [ 8.497934] LR [c008000004073278] fill_queue+0xf0/0x210 [virtio_console] > > [ 8.497976] Call Trace: > > [ 8.497993] [c00000002e4efa80] [c00800000407323c] fill_queue+0xb4/0x210 > > [virtio_console] (unreliable) > > [ 8.498052] [c00000002e4efae0] [c008000004073a90] add_port+0x1a8/0x470 [virtio_console] > > [ 8.498102] [c00000002e4efbb0] [c0080000040750f4] control_work_handler+0xbc/0x1e8 > > [virtio_console] > > [ 8.498160] [c00000002e4efc60] [c00000000017f4f0] process_one_work+0x290/0x590 > > [ 8.498212] [c00000002e4efd00] [c00000000017f878] worker_thread+0x88/0x620 > > [ 8.498256] [c00000002e4efda0] [c00000000018bc14] kthread+0x194/0x1a0 > > [ 8.498299] [c00000002e4efe10] [c00000000000cf54] ret_from_kernel_thread+0x5c/0x64 > > [ 8.498349] Instruction dump: > > [ 8.498374] 7da96b78 a14d0c8a 419c00b0 2f8a0000 419eff88 b32d0c8a 7c0004ac 4bffff7c > > [ 8.498430] 60000000 60000000 7fa3eb78 48002d95 <e8410018> 38600000 480025e1 e8410018 > > [ 8.498485] ---[ end trace 16ee10903290b647 ]--- > > [ 8.501433] > > [ 9.502601] Kernel panic - not syncing: Fatal exception > > > > add_port+0x1a8/0x470 : > > > > 1420 > > 1421 /* We can safely ignore ENOSPC because it means > > 1422 * the queue already has buffers. Buffers are removed > > 1423 * only by virtcons_remove(), not by unplug_port() > > 1424 */ > > ->1425 err = fill_queue(port->in_vq, &port->inbuf_lock); > > 1426 if (err < 0 && err != -ENOSPC) { > > 1427 dev_err(port->dev, "Error allocating inbufs\n"); > > 1428 goto free_device; > > 1429 } > > > > fill_queue+0x90/0x210 : > > > > 1326 static int fill_queue(struct virtqueue *vq, spinlock_t *lock) > > 1327 { > > 1328 struct port_buffer *buf; > > 1329 int nr_added_bufs; > > 1330 int ret; > > 1331 > > 1332 nr_added_bufs = 0; > > 1333 do { > > 1334 buf = alloc_buf(vq->vdev, PAGE_SIZE, 0); > > 1335 if (!buf) > > 1336 return -ENOMEM; > > 1337 > > ->1338 spin_lock_irq(lock); > > > > I'm using an upstream kernel (5.14-rc4, 251a1524293d) in the VM. > > > > My host is a RHEL 8.5/POWER9: qemu-kvm-6.0.0-21 and kernel-4.18.0-325 > > > > My qemu command line is: > > > > /usr/libexec/qemu-kvm \ > > -M pseries,accel=kvm \ > > -nographic -nodefaults \ > > -device virtio-serial-pci \ > > -device virtconsole \ > > -device virtio-net-pci,mac=9a:2b:2c:2d:2e:2f,netdev=hostnet0 \ > > -blockdev > > node-name=disk1,file.driver=file,driver=qcow2,file.driver=file,file.filename=disk.qcow2 \ > > -netdev bridge,id=hostnet0,br=virbr0,helper=/usr/libexec/qemu-bridge-helper \ > > -device virtio-blk-pci,id=image1,drive=disk1 \ > > -m 8192 \ > > -smp 4 \ > > -serial mon:stdio > > > > > > Do we need something in qemu/kvm to support STRICT_MODULE_RWX ? > > > > Thanks, > > Laurent > > This patch survived 300 consecutive runs without the issue: > > diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c > index 0876216ceee6..7aeb2b62e5dc 100644 > --- a/arch/powerpc/mm/pageattr.c > +++ b/arch/powerpc/mm/pageattr.c > @@ -35,10 +35,7 @@ static int change_page_attr(pte_t *ptep, unsigned long addr, void *data) > pte_t pte; > > spin_lock(&init_mm.page_table_lock); > - > - /* invalidate the PTE so it's safe to modify */ > - pte = ptep_get_and_clear(&init_mm, addr, ptep); > - flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > + pte = *ptep; > > /* modify the PTE bits as desired, then apply */ > switch (action) { > @@ -60,10 +57,7 @@ static int change_page_attr(pte_t *ptep, unsigned long addr, void *data) > } > > set_pte_at(&init_mm, addr, ptep, pte); > - > - /* See ptesync comment in radix__set_pte_at() */ > - if (radix_enabled()) > - asm volatile("ptesync": : :"memory"); > + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > spin_unlock(&init_mm.page_table_lock); > > return 0; > --- > > What I think is happening is that the virtio_console code is running > at the same time we are doing the `module_enable_ro` at the end of > `do_init_module` due to the async nature of the work handler. There is > a window after the TLB flush when the PTE has its permission bits > cleared, so any translations of the module code page attempted during > that window will fault. Thanks for looking at this. I agree that this is the problem. This avoids the crash (not a proper solution): --- a/kernel/module.c +++ b/kernel/module.c @@ -2011,13 +2011,15 @@ static void module_enable_ro(const struct module *mod, bool after_init) if (!rodata_enabled) return; - set_vm_flush_reset_perms(mod->core_layout.base); - set_vm_flush_reset_perms(mod->init_layout.base); - frob_text(&mod->core_layout, set_memory_ro); - - frob_rodata(&mod->core_layout, set_memory_ro); - frob_text(&mod->init_layout, set_memory_ro); - frob_rodata(&mod->init_layout, set_memory_ro); + if (!after_init) { + set_vm_flush_reset_perms(mod->core_layout.base); + set_vm_flush_reset_perms(mod->init_layout.base); + frob_text(&mod->core_layout, set_memory_ro); + + frob_rodata(&mod->core_layout, set_memory_ro); + frob_text(&mod->init_layout, set_memory_ro); + frob_rodata(&mod->init_layout, set_memory_ro); + } > > I'm ignorant of strict rwx in general so I don't see why we need to > clear the bits before setting them to their final value, but as I > understand it, the set_pte_at + flush_tlb_kernel_range satisfy the ISA > requirement of [ptesync; tlbie; eieio; tlbsync; ptesync;] so it seems > like the patch should work. I believe it's done like that because ISA 6.10.1.2 Modifying a Translation Table Entry - "The sequence is equivalent to deleting the PTE and then adding a new one". In that sequence, I think it is [set pte to 0; ptesync; tlbie; eieio; tlbsync; ptesync;] But it does seem like we need to do something like your patch for change_page_attr() to work as expected. > > Now, I cannot explain why the crash always happens around the code > that does the module's symbols relocation (the NIP in Laurent's trace > is the TOC reload from module_64.c:restore_r2). Maybe because > instructions are already in icache until the first branch into the > stub? Yeah, mpe thought it might be some cache effect of a branch instruction. >From time to time I did see Oopses like this (not "kernel tried to execute exec-protected page") [ 124.986964][ T52] Oops: Kernel access of bad area, sig: 11 [#1] [ 124.987043][ T52] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries [ 124.987095][ T52] Modules linked in: virtio_console binfmt_misc virtiofs fuse virtio_net virtio_blk virtio_pci virtio_pci_modern_dev virtio_ring virtio crc32c_vpmsum [last unloaded: virtio_console] [ 124.987209][ T52] CPU: 2 PID: 52 Comm: kworker/2:1 Not tainted 5.14.0-rc4 #4 [ 124.987259][ T52] Workqueue: events control_work_handler [virtio_console] [ 124.987307][ T52] NIP: c008000001a86044 LR: c008000001a82cf8 CTR: c00000000042a6c0 [ 124.987358][ T52] REGS: c00000000b273770 TRAP: 0300 Not tainted (5.14.0-rc4) [ 124.987406][ T52] MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 82002882 XER: 00000004 [ 124.987475][ T52] CFAR: c008000001a82cf4 DAR: c008000001a86058 DSISR: 40000000 IRQMASK: 0 [ 124.987475][ T52] GPR00: c008000001a82ce8 c00000000b273a10 c008000001ab8000 c000000010731320 [ 124.987475][ T52] GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000000 [ 124.987475][ T52] GPR08: 0000000000000000 0000000000000000 0000000000000000 c008000001a86038 [ 124.987475][ T52] GPR12: 0000000022002884 c0000001ffffdf00 c000000000169ae8 c0000000037700c0 [ 124.987475][ T52] GPR16: 0000000000000000 0000000000000000 c000000021af0000 c000000003313900 [ 124.987475][ T52] GPR20: 0000000000000001 0000000000000000 0000000000000000 c000000003316780 [ 124.987475][ T52] GPR24: c008000001a90288 0000000000000000 0000000000000000 c000000010731320 [ 124.987475][ T52] GPR28: c000000003316900 0000000000000018 0000000000000068 c00000000379c5a0 [ 124.988028][ T52] NIP [c008000001a86044] fini+0x824/0xa7e0 [virtio_console] [ 124.988085][ T52] LR [c008000001a82cf8] fill_queue+0xb0/0x230 [virtio_console] [ 124.988141][ T52] Call Trace: [ 124.988171][ T52] [c00000000b273a10] [c008000001a82ce8] fill_queue+0xa0/0x230 [virtio_console] (unreliable) [ 124.988246][ T52] [c00000000b273aa0] [c008000001a83208] add_port.isra.0+0x1a0/0x4b0 [virtio_console] [ 124.988312][ T52] [c00000000b273b70] [c008000001a85474] control_work_handler+0x46c/0x654 [virtio_console] [ 124.988403][ T52] [c00000000b273c70] [c00000000015ce60] process_one_work+0x2a0/0x570 [ 124.988586][ T52] [c00000000b273d10] [c00000000015d1d8] worker_thread+0xa8/0x660 [ 124.988684][ T52] [c00000000b273da0] [c000000000169c5c] kthread+0x17c/0x190 [ 124.988762][ T52] [c00000000b273e10] [c00000000000cf54] ret_from_kernel_thread+0x5c/0x64 [ 124.988916][ T52] Instruction dump: [ 124.988965][ T52] 396be010 f8410018 e98b0020 7d8903a6 4e800420 00000000 73747562 003a0300 [ 124.989152][ T52] c0000000 3d62fffd 396be038 f8410018 <e98b0020> 7d8903a6 4e800420 00000000 [ 124.989298][ T52] ---[ end trace ab9046c024eb3154 ]--- I guess it depends on the timing of which pte is invalidated. > > Anyway, this is what Murilo and I found out over our debugging session > in the past couple of days. I hope it helps. =) Yes it does, thanks heaps. >