On Tue, Mar 19, 2024 at 04:41:26PM +0100, Martin Wilck wrote: > Hello Ming, > > On Tue, 2024-03-19 at 21:04 +0800, Ming Lei wrote: > > Hello Martin, > > > > On Sat, Mar 16, 2024 at 12:10:35AM +0100, Martin Wilck wrote: > > > In a recent kernel dump analysis, we found that the kernel crashed > > > because > > > dm_rq_target_io tio->ti was pointing to invalid memory in > > > dm_end_request(), > > > in a situation where multipathd was doing map reloads because of a > > > storage > > > failover. The map of the respective mapped_device had been replaced > > > by a > > > different struct dm_table. > > > > > > We obverved this with a 5.3.18 distro kernel, but the code in > > > question > > > hasn't change much since then. Basically, we were only missing > > > b4459b11e840 ("dm rq: don't queue request to blk-mq during DM > > > suspend"), > > > which doesn't guarantee that the race I'm thinking of (see below) > > > can't > > > happen. > > > > > > When a map is resumed after a table reload, the live table is > > > swapped, and > > > the tio->ti member of any live request becomes stale. > > > __dm_resume() avoids > > > this by quiescing the queue and calling dm_wait_for_completion(), > > > which > > > waits until blk_mq_queue_inflight() doesn't report any in-flight > > > requests. > > > > > > However, blk_mq_queue_inflight() counts only "started" requests. > > > So, if a > > > request is dispatched before the queue was quiesced, but > > > dm_wait_for_completion() doesn't observe MQ_RQ_IN_FLIGHT for this > > > request > > > because of memory ordering effects, __dm_suspend() may finish > > > successfully, > > > > Can you explain a bit about the exact memory order which causes > > MQ_RQ_IN_FLIGHT > > not observed? > > > > blk-mq quiesce includes synchronize_rcu() which drains all in-flight > > dispatch, so after blk_mq_quiesce_queue() returns, > > if blk_mq_queue_inflight() returns 0, it does mean there isn't any > > active inflight requests. > > > > If there is bug in this pattern, I guess more drivers may have such > > 'risk'. > > What we know for sure is that there was a bad dm_target reference in > (struct dm_rq_target_io *tio)->ti: > > crash> struct -x dm_rq_target_io c00000245ca90128 > struct dm_rq_target_io { > md = 0xc0000031c66a4000, > ti = 0xc0080000020d0080 <fscache_object_list_lock+665632>, > > crash> struct -x dm_target 0xc0080000020d0080 > struct dm_target struct: invalid kernel virtual address: > c0080000020d0080 type: "gdb_readmem_callback" > > The question is how this could have come to pass. It can only happen > if tio->ti had been set before the map was reloaded. > My theory is that the IO had been dispatched before the queue had been > quiesced, like this: > > Task A Task B > (dispatching IO) (executing a DM_SUSPEND ioctl to > resume after DM_TABLE_LOAD) > do_resume() > dm_suspend() > __dm_suspend() > dm_mq_queue_rq() > struct dm_target *ti = > md->immutable_target; > dm_stop_queue() > blk_mq_quiesce_queue() > /* > * At this point, the queue is quiesced, but task A > * has alreadyentered dm_mq_queue_rq() > */ That shouldn't happen, blk_mq_quiesce_queue() drains all pending dm_mq_queue_rq() and prevents new dm_mq_queue_rq() from being called. > dm_wait_for_completion() > blk_mq_queue_inflight() > /* > * blk_mq_queue_inflight() doesn't see Task A's > * request because it isn't started yet > */ > set_bit(dmf_suspended_flag) > dm_start_request(md, rq); dm_swap_table() > __bind() > md->immutable_target = ... > dm_target_destroy() > /* the previous md->immutable_target is freed */ > init_tio(tio, rq, md); > /* the stale ti pointer is assigned to tio->ti */ > tio->ti = ti; > > dm_mq_queue_rq() contains no synchronization code if > md->immutable_target is set, so I think that this can happen, even > though it looks unlikely. With b4459b11e840 (which was not applied in > the customer kernel), there would be a > set_bit(DMF_BLOCK_IO_FOR_SUSPEND) statement before dm_stop_queue(), > but IMO that the above would still be possible. > > If this can't happen, I have no more ideas how the observed situation > came to pass. The customer who sent us the core claims that > he has seen this multiple times already (but we have only this single > core dump). Your kernel is v5.3, which is too old, so as Mike suggested, please try b4459b11e840 ("dm rq: don't queue request to blk-mq during DM suspend") given blk-mq has too many changes here. IMO, this issue should happen in upstream kernel. Thanks, Ming