On Sun, Jun 12, 2022 at 09:20:29AM -0700, Paul E. McKenney wrote: > On Sun, Jun 12, 2022 at 10:59:30PM +0800, zhangfei.gao@xxxxxxxxxxx wrote: > > Hi, Paul > > > > On 2022/6/12 下午9:36, Paul E. McKenney wrote: > > > On Sun, Jun 12, 2022 at 03:40:30PM +0800, zhangfei.gao@xxxxxxxxxxx wrote: > > > > Hi, Paul > > > > > > > > On 2022/6/12 上午12:59, Paul E. McKenney wrote: > > > > > On Sun, Jun 12, 2022 at 12:32:59AM +0800, Zhangfei Gao wrote: > > > > > > Hi, Paul > > > > > > > > > > > > When verifying qemu with acpi rmr feature on v5.19-rc1, the guest kernel > > > > > > stuck for several minutes. > > > > > Stuck for several minutes but then continues normally? Or stuck for > > > > > several minutes before you kill qemu? > > > > qemu boot stuck for several minutes, then guest can bootup normally, just > > > > slower. > > > > > And I have to ask... What happened without the ACPI RMR feature? > > > > If no ACPI, qemu boot quickly without stuck. > > > > build/aarch64-softmmu/qemu-system-aarch64 -machine > > > > virt,gic-version=3,iommu=smmuv3 \ > > > > -enable-kvm -cpu host -m 1024 \ > > > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \ > > > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off" > > > > > > > > Adding acpi=force & -bios QEMU_EFI.fd, qemu boot stuck for several minutes. > > > > > > > > By the way, my hardware platform is aarch64. > > > Thank you for the information! The problem is excessive delay rather > > > than a hang, and it is configuration-dependent. Good to know! > > > > > > > Only change this can solve the stuck issue. > > > > > > > > --- a/kernel/rcu/srcutree.c > > > > +++ b/kernel/rcu/srcutree.c > > > > @@ -524,6 +524,10 @@ static unsigned long srcu_get_delay(struct srcu_struct > > > > *ssp) > > > > { > > > > unsigned long jbase = SRCU_INTERVAL; > > > > > > > > + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), > > > > READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > > > > + return 0; > > > > + return SRCU_INTERVAL; > > > I am glad that you have a workaround for this problem, but this change > > > would re-introduce the problem that commit 282d8998e997 ("srcu: Prevent > > > expedited GPs and blocking readers from consuming CPU") was intended > > > to fix. For one example, your change can prevent kernel live patching > > > from applying a patch. So something else is needed. > > Understand, it is just debug where has issue. > > > > > > Does changing the value of SRCU_MAX_INTERVAL to (say) 3 decrease the delay > > > significantly? (This is not a fix, either, but instead a debug check.) > > No use. > > OK, that indicates that you have a very large number of invocations > of synchronize_srcu() or synchronize_srcu_expedited() instead of only > a few that take a very long time each. > > > > Your change always returns zero if another SRCU grace period is needed. > > > Let's look at the callers of srcu_get_delay(): > > > > > > o cleanup_srcu_struct() uses it to check whether there is an > > > expedited grace period pending, leaking the srcu_struct if so. > > > This should not affect boot delay. (Unless you are invoking > > > init_srcu_struct() and cleanup_srcu_struct() really really > > > often.) > > > > > > o srcu_gp_end() uses it to determine whether or not to allow > > > a one-jiffy delay before invoking callbacks at the end of > > > a grace period. > > > > > > o srcu_funnel_gp_start() uses it to determine whether or not to > > > allow a one-jiffy delay before starting the process of checking > > > for the end of an SRCU grace period. > > > > > > o try_check_zero() uses it to add an additional short delay > > > (instead of a long delay) between checks of reader state. > > > > > > o process_srcu() uses it to calculate the long delay between > > > checks of reader state. > > > > > > These add one-jiffy delays, except for process_srcu(), which adds a delay > > > of up to 10 jiffies. Even given HZ=100 (as opposed to the HZ=1000 that > > > I normally use), this requires thousands of such delays to add up to the > > > several minutes that you are seeing. (In theory, the delays could also > > > be due to SRCU readers, except that in that case adjusting timeouts in > > > the grace-period processing would not make things go faster.) > > > > > > So, does acpi=force & -bios QEMU_EFI.fd add SRCU grace periods? If so, > > > it would be very good make sure that this code is using SRCU efficiently. > > > One way to check would be to put a printk() into synchronize_srcu(), > > > though maintaining a counter and printing (say) every 1000th invocation > > > might be easier on the console output. > > > > good idea. > > Glad you like it. ;-) > > > > > > > And on 5.18, there is no such problem. > > > > > > > > > > > > After revert this patch, the issue solved. > > > > > > Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from > > > > > > consuming CPU) > > > > > > > > > > > > > > > > > > qemu cmd: > > > > > > build/aarch64-softmmu/qemu-system-aarch64 -machine > > > > > > virt,gic-version=3,iommu=smmuv3 \ > > > > > > -enable-kvm -cpu host -m 1024 \ > > > > > > -kernel Image -initrd mini-rootfs.cpio.gz -nographic -append \ > > > > > > "rdinit=init console=ttyAMA0 earlycon=pl011,0x9000000 kpti=off acpi=force" \ > > > > > > -bios QEMU_EFI.fd > > > > > > > > > > > > log: > > > > > > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040 > > > > > > add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll > > > > > > 0x75459000 > > > > > > Loading driver at 0x00075458000 EntryPoint=0x00075459058 IScsiDxe.efi > > > > > > InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7AA4DE98 > > > > > > ProtectUefiImageCommon - 0x7AA4D040 > > > > > > - 0x0000000075458000 - 0x000000000003F000 > > > > > > SetUefiImageMemoryAttributes - 0x0000000075458000 - 0x0000000000001000 > > > > > > (0x0000000000004008) > > > > > > SetUefiImageMemoryAttributes - 0x0000000075459000 - 0x000000000003B000 > > > > > > (0x0000000000020008) > > > > > > SetUefiImageMemoryAttributes - 0x0000000075494000 - 0x0000000000003000 > > > > > > (0x0000000000004008) > > > > > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952C8 > > > > > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358 > > > > > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370 > > > > > > InstallProtocolInterface: 18A031AB-B443-4D1A-A5C0-0C09261E9F71 754952F8 > > > > > > InstallProtocolInterface: 107A772C-D5E1-11D4-9A46-0090273FC14D 75495358 > > > > > > InstallProtocolInterface: 6A7A5CFF-E8D9-4F70-BADA-75AB3025CE14 75495370 > > > > > > InstallProtocolInterface: 59324945-EC44-4C0D-B1CD-9DB139DF070C 75495348 > > > > > > InstallProtocolInterface: 09576E91-6D3F-11D2-8E39-00A0C969723B 754953E8 > > > > > > InstallProtocolInterface: 330D4706-F2A0-4E4F-A369-B66FA8D54385 7AA4D728 > > > > > > > > > > > > > > > > > > Not sure it is either reported or solved. > > > > > This is the first I have heard of it, so thank you for reporting it. > > > > > > > > > > Do you have a way of collecting something sysrq-t output? > > > > Do you mean "echo t > /proc/sysrq-trigger", > > > > There are too much output and kernel dump can not stop. > > > OK. What other tools do you have to work out what is happening during > > > temporary hangs such as this one? > > > > > > The question to be answered: "Is there usually at least one task waiting > > > in synchronize_srcu() during these hangs, and if so, which srcu_struct > > > is passed to those synchronize_srcu() calls?" > > > > As you suggested, add print in __synchronize_srcu, 1000 times print once. > > > > With acpi=force & -bios QEMU_EFI.fd > > > > When qemu stuck in > > InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B 7AA4D040 > > add-symbol-file /home/linaro/work/edk2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC48/AARCH64/NetworkPkg/IScsiDxe/IScsiDxe/DEBUG/IScsiDxe.dll > > 0x75459000 > > > > The print in __synchronize_srcu is print from 0 t0 9001. > > Now that is what I call a large number of calls! > > > [ 94.271350] gzf __synchronize_srcu loop=1001 > > .... > > > > [ 222.621659] __synchronize_srcu loop=9001 > > [ 222.621664] CPU: 96 PID: 2294 Comm: qemu-system-aar Not tainted > > 5.19.0-rc1-15071-g697f40b5235f-dirty #615 > > [ 222.621666] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD, BIOS > > 2280-V2 CS V5.B133.01 03/25/2021 > > [ 222.621667] Call trace: > > [ 222.621668] dump_backtrace+0xe4/0xf0 > > [ 222.621670] show_stack+0x20/0x70 > > [ 222.621672] dump_stack_lvl+0x8c/0xb8 > > [ 222.621674] dump_stack+0x18/0x34 > > [ 222.621676] __synchronize_srcu+0x120/0x128 > > [ 222.621678] synchronize_srcu_expedited+0x2c/0x40 > > [ 222.621680] kvm_swap_active_memslots+0x130/0x198 > > [ 222.621683] kvm_activate_memslot+0x40/0x68 > > [ 222.621684] kvm_set_memslot+0x184/0x3b0 > > [ 222.621686] __kvm_set_memory_region+0x288/0x438 > > [ 222.621688] kvm_set_memory_region+0x3c/0x60 > > This is KVM setting up one mapping in your IORT RMR memory, correct? > (As in I/O remapping table reserved memory regions.) > > > [ 222.621689] kvm_vm_ioctl+0x5a0/0x13e0 > > And this ioctl() is qemu telling KVM to do so, correct? > > > [ 222.621691] __arm64_sys_ioctl+0xb0/0xf8 > > [ 222.621693] invoke_syscall+0x4c/0x110 > > [ 222.621696] el0_svc_common.constprop.0+0x68/0x128 > > [ 222.621698] do_el0_svc+0x34/0xc0 > > [ 222.621701] el0_svc+0x30/0x98 > > [ 222.621704] el0t_64_sync_handler+0xb8/0xc0 > > [ 222.621706] el0t_64_sync+0x18c/0x190 > > > > > > If no acpi=force, no print at all, 1000 times one print. > > OK, this certainly explains the slowdown, both from adding the IORT RMR > and from the commit that you bisected to. The answers to a few questions > might help illuminate a path towards a fix: > > Do these reserved memory regions really need to be allocated separately? > (For example, are they really all non-contiguous? If not, that is, if > there are a lot of contiguous memory regions, could you sort the IORT > by address and do one ioctl() for each set of contiguous memory regions?) > > Are all of these reserved memory regions set up before init is spawned? > > Are all of these reserved memory regions set up while there is only a > single vCPU up and running? > > Is the SRCU grace period really needed in this case? (I freely confess > to not being all that familiar with KVM.) Oh, and there was a similar many-requests problem with networking many years ago. This was solved by adding a new syscall/ioctl()/whatever that permitted many requests to be presented to the kernel with a single system call. Could a new ioctl() be introduced that requested a large number of these memory regions in one go so as to make each call to synchronize_rcu_expedited() cover a useful fraction of your 9000+ requests? Adding a few of the KVM guys on CC for their thoughts. Thanx, Paul