Possible race condition during lock_count increment in srcu_read_lock

Linu Cherian <linuc.decode@xxxxxxxxx> · Mon, 29 May 2017 20:00:57 +0530




Hi,

Issue observed:
--------------
Iperf traffic running on guest 10G interface(VFIO assigned device)
followed by, kvm guest poweroff/shutdown results in the following
warning on Cavium arm64 platform.

Stacktrace.
[ 1151.424692] ------------[ cut here ]------------
[ 1151.429318] WARNING: CPU: 30 PID: 6744 at kernel/rcu/srcu.c:251
cleanup_srcu_struct+0xb4/0xd0
[ 1151.437832] Modules linked in: fuse cavium_rng_vf rng_core
cavium_rng ipv6 btrfs xor zlib_deflate raid6_pq

[ 1151.448978] CPU: 30 PID: 6744 Comm: qemu-system-aar Tainted: G
W       4.11.2 #1
[ 1151.456970] Hardware name: Gigabyte gbt-mt60/gbt-mt60, BIOS 0.3 Mar
31 2017
[ 1151.463921] task: fffffe1f994a3400 task.stack: fffffe1f99590000
[ 1151.469831] PC is at cleanup_srcu_struct+0xb4/0xd0
[ 1151.474612] LR is at cleanup_srcu_struct+0x80/0xd0
[ 1151.479392] pc : [<fffffc000811698c>] lr : [<fffffc0008116958>]
pstate: 60000145
[ 1151.486776] sp : fffffe1f99593b20
[ 1151.490081] x29: fffffe1f99593b20 x28: fffffe1f994a3400 
[ 1151.495385] x27: 0000000000000008 x26: fffffe1f99593de8 
[ 1151.500689] x25: dead000000000100 x24: dead000000000200 
[ 1151.505993] x23: fffffc0008e93f24 x22: fffffffffffffff9 
[ 1151.511296] x21: ffffff000d93f0f0 x20: fffffc0008e93bc8 
[ 1151.516601] x19: fffffc0008e93e18 x18: 000000000000003f 
[ 1151.521904] x17: 000000000000003f x16: 000000000000000d 
[ 1151.527208] x15: 0000000000000010 x14: 0000000000000000 
[ 1151.532512] x13: 0000000000000040 x12: 0000000000000020 
[ 1151.537816] x11: 0000000000000020 x10: 0101010101010101 
[ 1151.543120] x9 : 0000000040000000 x8 : 0000000000210d00 
[ 1151.548424] x7 : fefefeff71647274 x6 : 0000000000000000 
[ 1151.553728] x5 : 0000000000000000 x4 : 0000000000000000 
[ 1151.559032] x3 : 0000000000000000 x2 : 0000000000000040 
[ 1151.564335] x1 : 0000000000000040 x0 : 0000000000000040 

[ 1151.571119] ---[ end trace ff1987e1b5556fbe ]---
[ 1151.575726] Call trace:
[ 1151.578163] Exception stack(0xfffffe1f99593950 to
0xfffffe1f99593a80)
[ 1151.584594] 3940:
fffffc0008e93e18 0000040000000000
[ 1151.592414] 3960: fffffe1f99593b20 fffffc000811698c
fffffe1f99593980 fffffc000839cc28
[ 1151.600233] 3980: fffffe1f995939d0 fffffc00081344cc
fffffc000896b000 fffffe1f99593ad8
[ 1151.608053] 39a0: fffffe1f99593ab0 fffffc00081d2a5c
fffffdffc7d95ac0 ffffff1f656b1a00
[ 1151.615873] 39c0: fffffe1f99593ad0 fffffc00081d2a5c
fffffdff87e75e00 fffffe1f9d780000
[ 1151.623692] 39e0: fffffe1f994a3400 000000000000051e
0000000000000040 0000000000000040
[ 1151.631511] 3a00: 0000000000000040 0000000000000000
0000000000000000 0000000000000000
[ 1151.639330] 3a20: 0000000000000000 fefefeff71647274
0000000000210d00 0000000040000000
[ 1151.647150] 3a40: 0101010101010101 0000000000000020
0000000000000020 0000000000000040
[ 1151.654970] 3a60: 0000000000000000 0000000000000010
000000000000000d 000000000000003f
[ 1151.662789] [<fffffc000811698c>] cleanup_srcu_struct+0xb4/0xd0
[ 1151.668614] [<fffffc000809c930>] kvm_put_kvm+0x1e0/0x238
[ 1151.673916] [<fffffc000809c9f8>] kvm_vm_release+0x20/0x30
[ 1151.679308] [<fffffc00081f0c64>] __fput+0x8c/0x1d0
[ 1151.684089] [<fffffc00081f0e0c>] ____fput+0xc/0x18
[ 1151.688872] [<fffffc00080d5830>] task_work_run+0xc0/0xe0
[ 1151.694176] [<fffffc00080bf194>] do_exit+0x2c4/0x978
[ 1151.699131] [<fffffc00080bf8ac>] do_group_exit+0x34/0x98
[ 1151.704435] [<fffffc00080c9118>] get_signal+0x1e0/0x4c0
[ 1151.709653] [<fffffc00080863d0>] do_signal+0xb8/0x4d8
[ 1151.714694] [<fffffc0008086a08>] do_notify_resume+0x88/0xa8
[ 1151.720256] [<fffffc0008082ad0>] work_pending+0x8/0x10



Analysis
--------
Additional prints in srcu_readers_active, indicate that per cpu
variable, lock_count( of struct srcu_struct)lags behind the unlock
count and hence the warning. 

In KVM irq injection path, irqfd_wakeup in interrupt context
calls srcu_read_lock(&kvm->irq_srcu). A worker thread, irqfd_inject
in process context also calls srcu_read_lock inside kvm_set_irq. 

This can lead to race condition while incrementing the
lock_count(using __this_cpu_inc), since atomic instructions being not
used. 

Replacing __this_cpu_inc with this_cpu_inc(atomic version) workarounds the
problem which backs up this analysis.

Possible solution
-----------------
One way is to avoid the srcu_read_lock/unlock usage in irq context.
In arm/arm64 architecture, 	
if (kvm_arch_set_irq_inatomic(&irq, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1,
      false) == -EWOULDBLOCK)
in irqfd_wakefup is always true and hence schedule_work can be called
directly

@@ -195,6 +195,11 @@ int __attribute__((weak))
kvm_arch_set_irq_inatomic(
        int idx;
 
        if (flags & POLLIN) {
+               if (kvm_arch_set_irq_will_block_always()) {
+                       schedule_work(&irqfd->inject);
+                       goto skiplock;
+               }
+
                idx = srcu_read_lock(&kvm->irq_srcu);
                do {
                        seq =
			read_seqcount_begin(&irqfd->irq_entry_sc);
@@ -208,6 +213,7 @@ int __attribute__((weak))
kvm_arch_set_irq_inatomic(
                srcu_read_unlock(&kvm->irq_srcu, idx);
        }
 
+skiplock:
        if (flags & POLLHUP) {
                /* The eventfd is closing, detach from KVM */
                unsigned long flags;

This works without giving any warnings as well.

Is a patch welcome in that direction ?

Appreicate your feedback on this.



-- 
Linu cherian