On Thu, 24 May 2018 09:12:09 +1000 Paul Mackerras <paulus@xxxxxxxxxx> wrote: > On Wed, May 23, 2018 at 07:04:21PM +0200, Greg Kurz wrote: > > On Sat, 19 May 2018 15:56:38 +1000 > > Paul Mackerras <paulus@xxxxxxxxxx> wrote: > > > > > This relaxes the restriction on using PR KVM on POWER9. The existing > > > code does work inside a guest partition running in HPT mode, because > > > hypercalls such as H_ENTER use the old HPTE format, not the new > > > format used by POWER9, and so no change to PR KVM's HPT manipulation > > > code is required. PR KVM will still refuse to run if the kernel is > > > using radix translation or if it is running bare-metal. > > > > > > Signed-off-by: Paul Mackerras <paulus@xxxxxxxxxx> > > > --- > > > > Paul, > > > > I have built a 4.16.0 kernel + this patch and booted the L1 guest > > with "disable_radix=on". I could then successfully boot a L2 guest, > > using the same kernel for simplicity. Both guests using identical > > fedora28 images. So it seems to be working at first sight. > > > > > > But, if I boot the L2 guest with the default fedora28 kernel, ie > > 4.16.9-300.fc28.ppc64le, the L2 guest hangs. > > > > OF stdout device is: /vdevice/vty@71000000 > > Preparing to boot Linux version 4.16.9-300.fc28.ppc64le (mockbuild@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) (gcc version 8.1.1 20180502 (Red Hat 8.1.1-1) (GCC)) #1 SMP Thu May 17 04:31:32 UTC 2018 > > Detected machine type: 0000000000000101 > > command line: BOOT_IMAGE=/boot/vmlinuz-4.16.9-300.fc28.ppc64le root=UUID=22128c5c-30b1-4e0a-ac16-95853df31131 ro rhgb console=hvc0 early_printk LANG=en_US.UTF-8 > > Max number of cores passed to firmware: 1024 (NR_CPUS = 1024) > > Calling ibm,client-architecture-support... done > > memory layout at init: > > memory_limit : 0000000000000000 (16 MB aligned) > > alloc_bottom : 0000000004e70000 > > alloc_top : 0000000030000000 > > alloc_top_hi : 0000000100000000 > > rmo_top : 0000000030000000 > > ram_top : 0000000100000000 > > instantiating rtas at 0x000000002fff0000... done > > prom_hold_cpus: skipped > > copying OF device tree... > > Building dt strings... > > Building dt structure... > > Device tree strings 0x0000000004e80000 -> 0x0000000004e80aaf > > Device tree struct 0x0000000004e90000 -> 0x0000000004ea0000 > > Quiescing Open Firmware ... > > Booting Linux via __start() @ 0x0000000002000000 ... > > > > (qemu) p $pc > > 0xc000000000026aa0 > > (qemu) p $lr > > 0xc000000000119ff4 > > > > # addr2line -e /usr/lib/debug/lib/modules/4.16.9-300.fc28.ppc64le/vmlinux 0xc000000000026aa0 > > /usr/src/debug/kernel-4.16.fc28/linux-4.16.9-300.fc28.ppc64le/./arch/powerpc/include/asm/time.h:115 > > > > # addr2line -e /usr/lib/debug/lib/modules/4.16.9-300.fc28.ppc64le/vmlinux 0xc000000000119ff4 > > /usr/src/debug/kernel-4.16.fc28/linux-4.16.9-300.fc28.ppc64le/kernel/panic.c:300 > > > > ie, the final mdelay(PANIC_TIMER_STEP) in panic(). > > > > Not sure how to debug this further, any suggestion is welcome :) > > I suggest you find the address of log_buf from System.map, read that > via the qemu command line (log_buf is a pointer), then dump the memory > it points to, so you can see the panic message. > Hi Paul, Thanks for your suggestion. I could reproduced the problem if I boot the L2 guest with an upstream kernel (commit d7b66b4ab034). I've tried to dump the log_buf but things didn't go well: $ grep 'd log_buf' System.map c000000001304f08 d log_buf_len c000000001304f10 d log_buf (qemu) x 0xc000000001304f08 c000000001304f08: Cannot access memory Since 4.16.0 works, I could bisect down to: commit dbfcf3cb9c681aa0c5d0bb46068f98d5b1823dd3 Author: Paul Mackerras <paulus@xxxxxxxxxx> Date: Thu Feb 16 16:03:39 2017 +1100 powerpc/64: Call H_REGISTER_PROC_TBL when running as a HPT guest on POWER9 The hcall is handled by QEMU, which then calls the KVM_PPC_CONFIGURE_V3_MMU ioctl, which fails since PR KVM doesn't implement it, and H_REGISTER_PROC_TBL fails with H_PARAMETER. The panic hence come from... static int pseries_lpar_register_process_table(unsigned long base, unsigned long page_size, unsigned long table_size) { . . . for (;;) { rc = plpar_hcall_norets(H_REGISTER_PROC_TBL, flags, base, page_size, table_size); if (!H_IS_LONG_BUSY(rc)) break; mdelay(get_longbusy_msecs(rc)); } if (rc != H_SUCCESS) { pr_err("Failed to register process table (rc=%ld)\n", rc); BUG(); ^^^ here. The changelog of commit dbfcf3cb9c68 reads: " If the hypervisor is able to support both radix and HPT guests, it would be entitled to defer allocation of the HPT until the H_REGISTER_PROC_TBL call" But in our case, the hypervisor is QEMU/PR KVM in a L1 guest booted with radix disabled. It is hence not "entitled to defer allocation of the HPT", and QEMU allocates one during initial machine reset. If I patch QEMU to make H_REGISTER_PROC_TBL a nop when KVM_CAP_PPC_MMU_RADIX returns 0, then the L2 kernel boots like a charm. So I'm wondering if the guest should even call H_REGISTER_PROC_TBL in this case, since there's nothing to do ? Also, peeking into PAPR, I see that H_REGISTER_PROC_TBL is mandatory only "If the platform supports the In-Memory Table Translation Option", which isn't the case here. This is supposed to be advertised through the "hcall-imtt" function set in the OF property "ibm,hypertas-functions" in the /rtas node. I guess a correct behavior would be for QEMU to advertise "hcall-imtt" when it supports both radix and hash, and the kernel should only call H_REGISTER_PROC_TBL if it is available. Of course, neither QEMU, nor the kernel seem to care about "hcall-imtt" today... so I guess the easier way is to fix H_REGISTER_PROC_TBL in QEMU. > Another thing to try would be to do the same test on a POWER8. > No surprise, it continues to work on a POWER8, since: /* * On POWER9, we need to do a H_REGISTER_PROC_TBL hcall * to inform the hypervisor that we wish to use the HPT. */ if (cpu_has_feature(CPU_FTR_ARCH_300)) register_process_table(0, 0, 0); > Paul. Cheers, -- Greg