use after free with blk_mq_update_nr_requests racing with do_resume/dm_swap_table

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi All,

I've been chasing down a kernel panic [2] on a RHEL7 kernel that stems from blk_done_softirq calling dm_done attempting to reference tio->ti, however tio->ti has been freed which leads to the panic.

It took me a while to figure out what was going on, but it appears as though blk_mq_update_nr_requests is racing with do_resume/dm_swap_table. blk_mq_update_nr_requests will un-quiesce the queue when it's complete, however this un-quiesce can sometimes occur while do_resume->dm_swap_table has quiesced the queues but before md->immutable_table has been updated with the new table reference. This leads to an I/O sneaking in, picking up tio->ti from md->immutable_table which is subsequently freed after the call to dm_swap_table and then when the softirq handler fires tio->ti has been freed and we end up with a panic.

I added some timers and debug printk's into the code hopefully you can see the race more clearly below [1].

I'm sure there are a myriad of approaches one could take to fix this, but to me it seems the patch here:

http://lists.infradead.org/pipermail/linux-nvme/2017-December/014447.html

Which allows concurrent calls to blk_mq_quiesce_queue would address this behavior.

While I haven't tested this on a mainline kernel, it seems to me that this race still exists there. I'll also raise this through RedHat, but since it seems to me to be an upstream issue as well I thought a post here would be appropriate.

I'm curious, what's the state of that patch? I do share Christoph's concern about the nvme driver being able to forcibly un-quiesce the queue and don't have any suggestions to offer, but it does seem to me that a quiesce operation isn't particularly useful if it can be unquiesced by another caller without knowledge of the original quiescer. It almost seems like there are two scenarios for the quiesce/unquiesce operations today--

1. a momentary pause where the quiesce/resume occur within the same function to change queue parameters (this seems to be most situations)

2. a longer pause to keep requests from being dispatched while device recovery occurs (this seems to me to be the nvme case)

To me it seems the 2nd case is incorrect, and one ought to distinguish between quiesced and stopped (e.g. you quiesce the queue to then stop it rather than using the quiesce as the stop mechanism).

-Aaron

[1]-

nr_requests [15679.617750] blk_mq_update_nr_requests: q=ffff99491720e988 enter

do_resume   [15679.639846] do_resume[1015]: md=ffff99491e7ed000 new_map=ffff99491dbc8800 do_resume   [15679.639848]      dm_suspend[1015]: md=ffff99491e7ed000 suspend_flags=2 enter do_resume   [15679.639850]        __dm_suspend[1015]: enter md=ffff99491e7ed000 md->flags=64 map=ffff99491dbc9600 suspend_flags=2 task_state=1 dmf_suspended_flags=1

nr_requests [15679.639860]   blk_mq_quiesce_queue: q=ffff99491720e988       \__ one is from dm_suspend, the other from update_nr_requests do_resume   [15679.641717]         blk_mq_quiesce_queue: q=ffff99491720e988 /   not sure which is which, but doesn't matter much

do_resume   [15679.666672]       dm_mq_stop_queue: q=ffff99491720e988 stopped do_resume   [15679.666684]     __dm_suspend[1015]: return end md=ffff99491e7ed000 map=ffff99491dbc9600 rc=0 do_resume   [15679.666688]   dm_suspend[1015]: md=ffff99491e7ed000 suspend_flags=2 exit r=0 do_resume   [15679.666690]   dm_swap_table: enter md=ffff99491e7ed000 md->flags=67 md->immutable_table=ffffb15b06d8d040 table=ffff99491dbc8800
do_resume                       dm_mq_stop_queue enter

nr_requests [15679.666695]       blk_mq_quiesce_queue: q=ffff99491720e988
nr_requests [15679.688675] blk_mq_unquiesce_queue: q=ffff99491720e988

do_resume    15679.694996073 = dm_start_time
do_resume    15679.695013154 = dm_mq_queue_rq_ns
do_resume    15679.695013194 = init_tio_ns (set tio->ti = ti; ti is md->immutable_table (ffffb15b06d8d040))

do_resume   [15679.695678]     dm_mq_stop_queue: q=ffff99491720e988 stopped
do_resume   [15679.695756]   dm_swap_table: md=ffff99491e7ed000 md->flags=67 (orig 67) md->immutable_table=ffffb15b06d93040 (orig ffffb15b06d8d040) old=ffff99491dbc8800 new=ffff99491dbc9600 table->type=3 (orig 3) suspended=1
do_resume   [15679.695758]   dm_resume[1015]: md=ffff99491e7ed000 enter
do_resume   [15679.695761] dm_mq_start_queue: q=ffff99491720e988 resuming
do_resume   [15679.695763] blk_mq_unquiesce_queue: q=ffff99491720e988
do_resume   [15679.708229]   dm_resume[1015]: md=ffff99491e7ed000 normal rc=0
do_resume                      tio->ti destroyed (ffffb15b06d8d040)
do_resume   [15679.708247] do_resume[1015]: md=ffff99491e7ed000 new_map=ffff99491dbc8800 rc=0

softirq handler fires==>
PANIC [15679.708629] BUG: unable to handle kernel paging request at ffffb15b06d8d040


[2]-

[15679.708629] BUG: unable to handle kernel paging request at ffffb15b06d8d040
[15679.723104] IP: [<ffffffffc0195d9d>] dm_softirq_done+0x5d/0x310 [dm_mod]
[15679.731099] PGD 17fd65067 PUD 8bfc02067 PMD 81d650067 PTE 0
[15679.737841] Oops: 0000 [#1] SMP
[15679.741954] Modules linked in: stap_dcf9a1b1b114d97436231b91be6919e_24275(OE) dm_round_robin ib_iser(OE) libiscsi scsi_transport_iscsi rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache lnet(OE) dell_rbu libcfs(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) sunrpc iTCO_wdt iTCO_vendor_support sb_edac coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev lpc_ich sg mei_me mei wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter pcc_cpufreq dm_multipath binfmt_misc knem(OE) ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sr_mod [15679.839498]  cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci drm libahci mlx5_core(OE) libata crct10dif_pclmul tg3 crct10dif_common mlxfw(OE) vfio_mdev(OE) vfio_iommu_type1 crc32c_intel vfio megaraid_sas mdev(OE) devlink mlx_compat(OE) drm_panel_orientation_quirks ptp pps_core dm_mirror dm_region_hash dm_log dm_mod(OE) [last unloaded: stap_18d5e581b9a87f90fa46e5292bea9ac_14254] [15679.883540] CPU: 3 PID: 26 Comm: ksoftirqd/3 Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1062.9.1.el7.x86_64 #1
[15679.898105] Hardware name:    /0H47HH, BIOS 2.7.0 05/23/2018
[15679.905263] task: ffff9941b4e9c1c0 ti: ffff9941b4484000 task.ti: ffff9941b4484000 [15679.914473] RIP: 0010:[<ffffffffc0195d9d>] [<ffffffffc0195d9d>] dm_softirq_done+0x5d/0x310 [dm_mod]
[15679.925561] RSP: 0018:ffff9941b4487da0  EFLAGS: 00010202
[15679.932364] RAX: 0000000000000001 RBX: 0000000000000000 RCX: dead000000000200 [15679.941212] RDX: ffff9941b4487dd8 RSI: ffff9941b4487dc8 RDI: ffffb15b06d8d040 [15679.950064] RBP: ffff9941b4487dc8 R08: ffff9948f1280980 R09: ffff99491f053900 [15679.958914] R10: 000000000000b731 R11: 000000000000000d R12: ffff9948df314d00 [15679.967767] R13: ffff9948f1280a80 R14: 0000000000000000 R15: 0000000000000009 [15679.976631] FS:  0000000000000000(0000) GS:ffff99491f040000(0000) knlGS:0000000000000000
[15679.986581] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[15679.993913] CR2: ffffb15b06d8d040 CR3: 000000101d272000 CR4: 00000000001607e0
[15680.002810] Call Trace:
[15680.006473]  [<ffffffffaed58d86>] blk_done_softirq+0x96/0xc0
[15680.013741]  [<ffffffffaeaa5305>] __do_softirq+0xf5/0x280
[15680.020714]  [<ffffffffaeaa54c8>] run_ksoftirqd+0x38/0x50
[15680.027693]  [<ffffffffaeacec24>] smpboot_thread_fn+0x144/0x1a0
[15680.035251]  [<ffffffffaeaceae0>] ? lg_double_unlock+0x40/0x40
[15680.042710]  [<ffffffffaeac61f1>] kthread+0xd1/0xe0
[15680.049100]  [<ffffffffaeac6120>] ? insert_kthread_work+0x40/0x40
[15680.056847]  [<ffffffffaf18dd37>] ret_from_fork_nospec_begin+0x21/0x21
[15680.065091]  [<ffffffffaeac6120>] ? insert_kthread_work+0x40/0x40
[15680.072848] Code: 01 00 00 4d 85 e4 0f 84 92 00 00 00 48 8b 43 48 41 8b 5d 58 4d 8b ac 24 70 01 00 00 48 c1 e8 16 49 8b 7d 08 48 83 f0 01 83 e0 01 <48> 83 3f 00 74 0c 48 8b 57 08 84 c0 4c 8b 42 60 75 41 83 fb 87
[15680.096566] RIP  [<ffffffffc0195d9d>] dm_softirq_done+0x5d/0x310 [dm_mod]
[15680.105181]  RSP <ffff9941b4487da0>
[15680.110091] CR2: ffffb15b06d8d040


--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel




[Index of Archives]     [DM Crypt]     [Fedora Desktop]     [ATA RAID]     [Fedora Marketing]     [Fedora Packaging]     [Fedora SELinux]     [Yosemite Discussion]     [KDE Users]     [Fedora Docs]

  Powered by Linux