Re: [BUG] Deadlock due due to interactions of block, RCU, and cpu offline

Jeffrey Hugo <jhugo@xxxxxxxxxxxxxx> · Tue, 22 Aug 2017 14:53:55 -0600

On 8/22/2017 10:12 AM, Paolo Bonzini wrote:
On 20/08/2017 22:56, Paul E. McKenney wrote:
       KVM: async_pf: avoid async pf injection when in guest mode
       KVM: cpuid: Fix read/write out-of-bounds vulnerability in cpuid emulation
       arm: KVM: Allow unaligned accesses at HYP
       arm64: KVM: Allow unaligned accesses at EL2
       arm64: KVM: Preserve RES1 bits in SCTLR_EL2
       KVM: arm/arm64: Handle possible NULL stage2 pud when ageing pages
       KVM: nVMX: Fix exception injection
       kvm: async_pf: fix rcu_irq_enter() with irqs enabled
       KVM: arm/arm64: vgic-v3: Fix nr_pre_bits bitfield extraction
       KVM: s390: fix ais handling vs cpu model
       KVM: arm/arm64: Fix isues with GICv2 on GICv3 migration

Nothing really stands out to me which would "fix" the issue.

My guess would be an undo of the change that provoked the problem
in the first place.  Did you try bisecting within the above group
of commits?

Either way, CCing Paolo for his thoughts?

There is "kvm: async_pf: fix rcu_irq_enter() with irqs enabled", but it
would have caused splats, not deadlocks.

If you are using nested virtualization, "KVM: async_pf: avoid async pf
injection when in guest mode" can be a wildcard, but only if you have
memory pressure.

My bet is still on the former changing the timing just a little bit.

Paolo

I'm sorry, I must have done the bisect incorrectly.

I attempted to bisect the KVM changes from the merge, but was seeing 
that the issue didn't repro with any of them.  I double checked the 
merge commit, and found it did not introduce a "fix".

I redid the bisect, and it identified the following change this time.  I 
double checked that reverting the change reintroduces the deadlock, and 
cherry-picking the change onto 4.12-rc4 (known to exhibit the issue) 
causes the issue to disappear.  I'm pretty sure (knock on wood) that the 
bisect result is actually correct this time.

commit 6460495709aeb651896bc8e5c134b2e4ca7d34a8
Author: James Wang <jnwang@xxxxxxxx>
Date:   Thu Jun 8 14:52:51 2017 +0800

    Fix loop device flush before configure v3

    While installing SLES-12 (based on v4.4), I found that the installer
    will stall for 60+ seconds during LVM disk scan.  The root cause was
    determined to be the removal of a bound device check in loop_flush()
    by commit b5dd2f6047ca ("block: loop: improve performance via blk-mq").

    Restoring this check, examining ->lo_state as set by loop_set_fd()
    eliminates the bad behavior.

    Test method:
    modprobe loop max_loop=64
    dd if=/dev/zero of=disk bs=512 count=200K
    for((i=0;i<4;i++))do losetup -f disk; done
    mkfs.ext4 -F /dev/loop0
    for((i=0;i<4;i++))do mkdir t$i; mount /dev/loop$i t$i;done
    for f in `ls /dev/loop[0-9]*|sort`; do \
        echo $f; dd if=$f of=/dev/null  bs=512 count=1; \
        done

    Test output:  stock          patched
    /dev/loop0    18.1217e-05    8.3842e-05
    /dev/loop1     6.1114e-05    0.000147979
    /dev/loop10    0.414701      0.000116564
    /dev/loop11    0.7474        6.7942e-05
    /dev/loop12    0.747986      8.9082e-05
    /dev/loop13    0.746532      7.4799e-05
    /dev/loop14    0.480041      9.3926e-05
    /dev/loop15    1.26453       7.2522e-05

    Note that from loop10 onward, the device is not mounted, yet the
    stock kernel consumes several orders of magnitude more wall time
    than it does for a mounted device.
    (Thanks for Mike Galbraith <efault@xxxxxx>, give a changelog review.)

    Reviewed-by: Hannes Reinecke <hare@xxxxxxxx>
    Reviewed-by: Ming Lei <ming.lei@xxxxxxxxxx>
    Signed-off-by: James Wang <jnwang@xxxxxxxx>
    Fixes: b5dd2f6047ca ("block: loop: improve performance via blk-mq")
    Signed-off-by: Jens Axboe <axboe@xxxxxx>

Considering the original analysis of the issue, it seems plausible that 
this change could be fixing it.

--
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.