Hi, Issue observed: -------------- Iperf traffic running on guest 10G interface(VFIO assigned device) followed by, kvm guest poweroff/shutdown results in the following warning on Cavium arm64 platform. Stacktrace. [ 1151.424692] ------------[ cut here ]------------ [ 1151.429318] WARNING: CPU: 30 PID: 6744 at kernel/rcu/srcu.c:251 cleanup_srcu_struct+0xb4/0xd0 [ 1151.437832] Modules linked in: fuse cavium_rng_vf rng_core cavium_rng ipv6 btrfs xor zlib_deflate raid6_pq [ 1151.448978] CPU: 30 PID: 6744 Comm: qemu-system-aar Tainted: G W 4.11.2 #1 [ 1151.456970] Hardware name: Gigabyte gbt-mt60/gbt-mt60, BIOS 0.3 Mar 31 2017 [ 1151.463921] task: fffffe1f994a3400 task.stack: fffffe1f99590000 [ 1151.469831] PC is at cleanup_srcu_struct+0xb4/0xd0 [ 1151.474612] LR is at cleanup_srcu_struct+0x80/0xd0 [ 1151.479392] pc : [<fffffc000811698c>] lr : [<fffffc0008116958>] pstate: 60000145 [ 1151.486776] sp : fffffe1f99593b20 [ 1151.490081] x29: fffffe1f99593b20 x28: fffffe1f994a3400 [ 1151.495385] x27: 0000000000000008 x26: fffffe1f99593de8 [ 1151.500689] x25: dead000000000100 x24: dead000000000200 [ 1151.505993] x23: fffffc0008e93f24 x22: fffffffffffffff9 [ 1151.511296] x21: ffffff000d93f0f0 x20: fffffc0008e93bc8 [ 1151.516601] x19: fffffc0008e93e18 x18: 000000000000003f [ 1151.521904] x17: 000000000000003f x16: 000000000000000d [ 1151.527208] x15: 0000000000000010 x14: 0000000000000000 [ 1151.532512] x13: 0000000000000040 x12: 0000000000000020 [ 1151.537816] x11: 0000000000000020 x10: 0101010101010101 [ 1151.543120] x9 : 0000000040000000 x8 : 0000000000210d00 [ 1151.548424] x7 : fefefeff71647274 x6 : 0000000000000000 [ 1151.553728] x5 : 0000000000000000 x4 : 0000000000000000 [ 1151.559032] x3 : 0000000000000000 x2 : 0000000000000040 [ 1151.564335] x1 : 0000000000000040 x0 : 0000000000000040 [ 1151.571119] ---[ end trace ff1987e1b5556fbe ]--- [ 1151.575726] Call trace: [ 1151.578163] Exception stack(0xfffffe1f99593950 to 0xfffffe1f99593a80) [ 1151.584594] 3940: fffffc0008e93e18 0000040000000000 [ 1151.592414] 3960: fffffe1f99593b20 fffffc000811698c fffffe1f99593980 fffffc000839cc28 [ 1151.600233] 3980: fffffe1f995939d0 fffffc00081344cc fffffc000896b000 fffffe1f99593ad8 [ 1151.608053] 39a0: fffffe1f99593ab0 fffffc00081d2a5c fffffdffc7d95ac0 ffffff1f656b1a00 [ 1151.615873] 39c0: fffffe1f99593ad0 fffffc00081d2a5c fffffdff87e75e00 fffffe1f9d780000 [ 1151.623692] 39e0: fffffe1f994a3400 000000000000051e 0000000000000040 0000000000000040 [ 1151.631511] 3a00: 0000000000000040 0000000000000000 0000000000000000 0000000000000000 [ 1151.639330] 3a20: 0000000000000000 fefefeff71647274 0000000000210d00 0000000040000000 [ 1151.647150] 3a40: 0101010101010101 0000000000000020 0000000000000020 0000000000000040 [ 1151.654970] 3a60: 0000000000000000 0000000000000010 000000000000000d 000000000000003f [ 1151.662789] [<fffffc000811698c>] cleanup_srcu_struct+0xb4/0xd0 [ 1151.668614] [<fffffc000809c930>] kvm_put_kvm+0x1e0/0x238 [ 1151.673916] [<fffffc000809c9f8>] kvm_vm_release+0x20/0x30 [ 1151.679308] [<fffffc00081f0c64>] __fput+0x8c/0x1d0 [ 1151.684089] [<fffffc00081f0e0c>] ____fput+0xc/0x18 [ 1151.688872] [<fffffc00080d5830>] task_work_run+0xc0/0xe0 [ 1151.694176] [<fffffc00080bf194>] do_exit+0x2c4/0x978 [ 1151.699131] [<fffffc00080bf8ac>] do_group_exit+0x34/0x98 [ 1151.704435] [<fffffc00080c9118>] get_signal+0x1e0/0x4c0 [ 1151.709653] [<fffffc00080863d0>] do_signal+0xb8/0x4d8 [ 1151.714694] [<fffffc0008086a08>] do_notify_resume+0x88/0xa8 [ 1151.720256] [<fffffc0008082ad0>] work_pending+0x8/0x10 Analysis -------- Additional prints in srcu_readers_active, indicate that per cpu variable, lock_count( of struct srcu_struct)lags behind the unlock count and hence the warning. In KVM irq injection path, irqfd_wakeup in interrupt context calls srcu_read_lock(&kvm->irq_srcu). A worker thread, irqfd_inject in process context also calls srcu_read_lock inside kvm_set_irq. This can lead to race condition while incrementing the lock_count(using __this_cpu_inc), since atomic instructions being not used. Replacing __this_cpu_inc with this_cpu_inc(atomic version) workarounds the problem which backs up this analysis. Possible solution ----------------- One way is to avoid the srcu_read_lock/unlock usage in irq context. In arm/arm64 architecture, if (kvm_arch_set_irq_inatomic(&irq, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1, false) == -EWOULDBLOCK) in irqfd_wakefup is always true and hence schedule_work can be called directly @@ -195,6 +195,11 @@ int __attribute__((weak)) kvm_arch_set_irq_inatomic( int idx; if (flags & POLLIN) { + if (kvm_arch_set_irq_will_block_always()) { + schedule_work(&irqfd->inject); + goto skiplock; + } + idx = srcu_read_lock(&kvm->irq_srcu); do { seq = read_seqcount_begin(&irqfd->irq_entry_sc); @@ -208,6 +213,7 @@ int __attribute__((weak)) kvm_arch_set_irq_inatomic( srcu_read_unlock(&kvm->irq_srcu, idx); } +skiplock: if (flags & POLLHUP) { /* The eventfd is closing, detach from KVM */ unsigned long flags; This works without giving any warnings as well. Is a patch welcome in that direction ? Appreicate your feedback on this. -- Linu cherian