Hi All,
I've been chasing down a kernel panic [2] on a RHEL7 kernel that stems
from blk_done_softirq calling dm_done attempting to reference tio->ti,
however tio->ti has been freed which leads to the panic.
It took me a while to figure out what was going on, but it appears as
though blk_mq_update_nr_requests is racing with do_resume/dm_swap_table.
blk_mq_update_nr_requests will un-quiesce the queue when it's complete,
however this un-quiesce can sometimes occur while
do_resume->dm_swap_table has quiesced the queues but before
md->immutable_table has been updated with the new table reference. This
leads to an I/O sneaking in, picking up tio->ti from md->immutable_table
which is subsequently freed after the call to dm_swap_table and then
when the softirq handler fires tio->ti has been freed and we end up with
a panic.
I added some timers and debug printk's into the code hopefully you can
see the race more clearly below [1].
I'm sure there are a myriad of approaches one could take to fix this,
but to me it seems the patch here:
http://lists.infradead.org/pipermail/linux-nvme/2017-December/014447.html
Which allows concurrent calls to blk_mq_quiesce_queue would address this
behavior.
While I haven't tested this on a mainline kernel, it seems to me that
this race still exists there. I'll also raise this through RedHat, but
since it seems to me to be an upstream issue as well I thought a post
here would be appropriate.
I'm curious, what's the state of that patch? I do share Christoph's
concern about the nvme driver being able to forcibly un-quiesce the
queue and don't have any suggestions to offer, but it does seem to me
that a quiesce operation isn't particularly useful if it can be
unquiesced by another caller without knowledge of the original quiescer.
It almost seems like there are two scenarios for the quiesce/unquiesce
operations today--
1. a momentary pause where the quiesce/resume occur within the same
function to change queue parameters (this seems to be most situations)
2. a longer pause to keep requests from being dispatched while device
recovery occurs (this seems to me to be the nvme case)
To me it seems the 2nd case is incorrect, and one ought to distinguish
between quiesced and stopped (e.g. you quiesce the queue to then stop it
rather than using the quiesce as the stop mechanism).
-Aaron
[1]-
nr_requests [15679.617750] blk_mq_update_nr_requests: q=ffff99491720e988
enter
do_resume [15679.639846] do_resume[1015]: md=ffff99491e7ed000
new_map=ffff99491dbc8800
do_resume [15679.639848] dm_suspend[1015]: md=ffff99491e7ed000
suspend_flags=2 enter
do_resume [15679.639850] __dm_suspend[1015]: enter
md=ffff99491e7ed000 md->flags=64 map=ffff99491dbc9600 suspend_flags=2
task_state=1 dmf_suspended_flags=1
nr_requests [15679.639860] blk_mq_quiesce_queue:
q=ffff99491720e988 \__ one is from dm_suspend, the other from
update_nr_requests
do_resume [15679.641717] blk_mq_quiesce_queue:
q=ffff99491720e988 / not sure which is which, but doesn't matter much
do_resume [15679.666672] dm_mq_stop_queue: q=ffff99491720e988
stopped
do_resume [15679.666684] __dm_suspend[1015]: return end
md=ffff99491e7ed000 map=ffff99491dbc9600 rc=0
do_resume [15679.666688] dm_suspend[1015]: md=ffff99491e7ed000
suspend_flags=2 exit r=0
do_resume [15679.666690] dm_swap_table: enter md=ffff99491e7ed000
md->flags=67 md->immutable_table=ffffb15b06d8d040 table=ffff99491dbc8800
do_resume dm_mq_stop_queue enter
nr_requests [15679.666695] blk_mq_quiesce_queue: q=ffff99491720e988
nr_requests [15679.688675] blk_mq_unquiesce_queue: q=ffff99491720e988
do_resume 15679.694996073 = dm_start_time
do_resume 15679.695013154 = dm_mq_queue_rq_ns
do_resume 15679.695013194 = init_tio_ns (set tio->ti = ti; ti is
md->immutable_table (ffffb15b06d8d040))
do_resume [15679.695678] dm_mq_stop_queue: q=ffff99491720e988 stopped
do_resume [15679.695756] dm_swap_table: md=ffff99491e7ed000
md->flags=67 (orig 67) md->immutable_table=ffffb15b06d93040 (orig
ffffb15b06d8d040) old=ffff99491dbc8800 new=ffff99491dbc9600
table->type=3 (orig 3) suspended=1
do_resume [15679.695758] dm_resume[1015]: md=ffff99491e7ed000 enter
do_resume [15679.695761] dm_mq_start_queue: q=ffff99491720e988 resuming
do_resume [15679.695763] blk_mq_unquiesce_queue: q=ffff99491720e988
do_resume [15679.708229] dm_resume[1015]: md=ffff99491e7ed000 normal
rc=0
do_resume tio->ti destroyed (ffffb15b06d8d040)
do_resume [15679.708247] do_resume[1015]: md=ffff99491e7ed000
new_map=ffff99491dbc8800 rc=0
softirq handler fires==>
PANIC [15679.708629] BUG: unable to handle kernel paging request at
ffffb15b06d8d040
[2]-
[15679.708629] BUG: unable to handle kernel paging request at
ffffb15b06d8d040
[15679.723104] IP: [<ffffffffc0195d9d>] dm_softirq_done+0x5d/0x310 [dm_mod]
[15679.731099] PGD 17fd65067 PUD 8bfc02067 PMD 81d650067 PTE 0
[15679.737841] Oops: 0000 [#1] SMP
[15679.741954] Modules linked in:
stap_dcf9a1b1b114d97436231b91be6919e_24275(OE) dm_round_robin
ib_iser(OE) libiscsi scsi_transport_iscsi rpcsec_gss_krb5 auth_rpcgss
nfsv4 dns_resolver nfs lockd grace fscache lnet(OE) dell_rbu libcfs(OE)
rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE)
ib_umad(OE) mlx5_fpga_tools(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE)
sunrpc iTCO_wdt iTCO_vendor_support sb_edac coretemp intel_rapl iosf_mbi
kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw
gf128mul glue_helper ablk_helper cryptd pcspkr joydev lpc_ich sg mei_me
mei wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter
pcc_cpufreq dm_multipath binfmt_misc knem(OE) ip_tables ext4 mbcache
jbd2 sd_mod crc_t10dif crct10dif_generic mlx5_ib(OE) ib_uverbs(OE)
ib_core(OE) sr_mod
[15679.839498] cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea
sysfillrect sysimgblt fb_sys_fops ttm ahci drm libahci mlx5_core(OE)
libata crct10dif_pclmul tg3 crct10dif_common mlxfw(OE) vfio_mdev(OE)
vfio_iommu_type1 crc32c_intel vfio megaraid_sas mdev(OE) devlink
mlx_compat(OE) drm_panel_orientation_quirks ptp pps_core dm_mirror
dm_region_hash dm_log dm_mod(OE) [last unloaded:
stap_18d5e581b9a87f90fa46e5292bea9ac_14254]
[15679.883540] CPU: 3 PID: 26 Comm: ksoftirqd/3 Kdump: loaded Tainted:
G OE ------------ T 3.10.0-1062.9.1.el7.x86_64 #1
[15679.898105] Hardware name: /0H47HH, BIOS 2.7.0 05/23/2018
[15679.905263] task: ffff9941b4e9c1c0 ti: ffff9941b4484000 task.ti:
ffff9941b4484000
[15679.914473] RIP: 0010:[<ffffffffc0195d9d>] [<ffffffffc0195d9d>]
dm_softirq_done+0x5d/0x310 [dm_mod]
[15679.925561] RSP: 0018:ffff9941b4487da0 EFLAGS: 00010202
[15679.932364] RAX: 0000000000000001 RBX: 0000000000000000 RCX:
dead000000000200
[15679.941212] RDX: ffff9941b4487dd8 RSI: ffff9941b4487dc8 RDI:
ffffb15b06d8d040
[15679.950064] RBP: ffff9941b4487dc8 R08: ffff9948f1280980 R09:
ffff99491f053900
[15679.958914] R10: 000000000000b731 R11: 000000000000000d R12:
ffff9948df314d00
[15679.967767] R13: ffff9948f1280a80 R14: 0000000000000000 R15:
0000000000000009
[15679.976631] FS: 0000000000000000(0000) GS:ffff99491f040000(0000)
knlGS:0000000000000000
[15679.986581] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[15679.993913] CR2: ffffb15b06d8d040 CR3: 000000101d272000 CR4:
00000000001607e0
[15680.002810] Call Trace:
[15680.006473] [<ffffffffaed58d86>] blk_done_softirq+0x96/0xc0
[15680.013741] [<ffffffffaeaa5305>] __do_softirq+0xf5/0x280
[15680.020714] [<ffffffffaeaa54c8>] run_ksoftirqd+0x38/0x50
[15680.027693] [<ffffffffaeacec24>] smpboot_thread_fn+0x144/0x1a0
[15680.035251] [<ffffffffaeaceae0>] ? lg_double_unlock+0x40/0x40
[15680.042710] [<ffffffffaeac61f1>] kthread+0xd1/0xe0
[15680.049100] [<ffffffffaeac6120>] ? insert_kthread_work+0x40/0x40
[15680.056847] [<ffffffffaf18dd37>] ret_from_fork_nospec_begin+0x21/0x21
[15680.065091] [<ffffffffaeac6120>] ? insert_kthread_work+0x40/0x40
[15680.072848] Code: 01 00 00 4d 85 e4 0f 84 92 00 00 00 48 8b 43 48 41
8b 5d 58 4d 8b ac 24 70 01 00 00 48 c1 e8 16 49 8b 7d 08 48 83 f0 01 83
e0 01 <48> 83 3f 00 74 0c 48 8b 57 08 84 c0 4c 8b 42 60 75 41 83 fb 87
[15680.096566] RIP [<ffffffffc0195d9d>] dm_softirq_done+0x5d/0x310 [dm_mod]
[15680.105181] RSP <ffff9941b4487da0>
[15680.110091] CR2: ffffb15b06d8d040
--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel