Hello Ming, On Tue, 2024-03-19 at 21:04 +0800, Ming Lei wrote: > Hello Martin, > > On Sat, Mar 16, 2024 at 12:10:35AM +0100, Martin Wilck wrote: > > In a recent kernel dump analysis, we found that the kernel crashed > > because > > dm_rq_target_io tio->ti was pointing to invalid memory in > > dm_end_request(), > > in a situation where multipathd was doing map reloads because of a > > storage > > failover. The map of the respective mapped_device had been replaced > > by a > > different struct dm_table. > > > > We obverved this with a 5.3.18 distro kernel, but the code in > > question > > hasn't change much since then. Basically, we were only missing > > b4459b11e840 ("dm rq: don't queue request to blk-mq during DM > > suspend"), > > which doesn't guarantee that the race I'm thinking of (see below) > > can't > > happen. > > > > When a map is resumed after a table reload, the live table is > > swapped, and > > the tio->ti member of any live request becomes stale. > > __dm_resume() avoids > > this by quiescing the queue and calling dm_wait_for_completion(), > > which > > waits until blk_mq_queue_inflight() doesn't report any in-flight > > requests. > > > > However, blk_mq_queue_inflight() counts only "started" requests. > > So, if a > > request is dispatched before the queue was quiesced, but > > dm_wait_for_completion() doesn't observe MQ_RQ_IN_FLIGHT for this > > request > > because of memory ordering effects, __dm_suspend() may finish > > successfully, > > Can you explain a bit about the exact memory order which causes > MQ_RQ_IN_FLIGHT > not observed? > > blk-mq quiesce includes synchronize_rcu() which drains all in-flight > dispatch, so after blk_mq_quiesce_queue() returns, > if blk_mq_queue_inflight() returns 0, it does mean there isn't any > active inflight requests. > > If there is bug in this pattern, I guess more drivers may have such > 'risk'. What we know for sure is that there was a bad dm_target reference in (struct dm_rq_target_io *tio)->ti: crash> struct -x dm_rq_target_io c00000245ca90128 struct dm_rq_target_io { md = 0xc0000031c66a4000, ti = 0xc0080000020d0080 <fscache_object_list_lock+665632>, crash> struct -x dm_target 0xc0080000020d0080 struct dm_target struct: invalid kernel virtual address: c0080000020d0080 type: "gdb_readmem_callback" The question is how this could have come to pass. It can only happen if tio->ti had been set before the map was reloaded. My theory is that the IO had been dispatched before the queue had been quiesced, like this: Task A Task B (dispatching IO) (executing a DM_SUSPEND ioctl to resume after DM_TABLE_LOAD) do_resume() dm_suspend() __dm_suspend() dm_mq_queue_rq() struct dm_target *ti = md->immutable_target; dm_stop_queue() blk_mq_quiesce_queue() /* * At this point, the queue is quiesced, but task A * has alreadyentered dm_mq_queue_rq() */ dm_wait_for_completion() blk_mq_queue_inflight() /* * blk_mq_queue_inflight() doesn't see Task A's * request because it isn't started yet */ set_bit(dmf_suspended_flag) dm_start_request(md, rq); dm_swap_table() __bind() md->immutable_target = ... dm_target_destroy() /* the previous md->immutable_target is freed */ init_tio(tio, rq, md); /* the stale ti pointer is assigned to tio->ti */ tio->ti = ti; dm_mq_queue_rq() contains no synchronization code if md->immutable_target is set, so I think that this can happen, even though it looks unlikely. With b4459b11e840 (which was not applied in the customer kernel), there would be a set_bit(DMF_BLOCK_IO_FOR_SUSPEND) statement before dm_stop_queue(), but IMO that the above would still be possible. If this can't happen, I have no more ideas how the observed situation came to pass. The customer who sent us the core claims that he has seen this multiple times already (but we have only this single core dump). > BTW, what is the underlying disks in your dm-mpath setting? It was ibmvfc. Thanks for looking into this, Martin