On Wed, 2024-03-20 at 11:03 +0800, Ming Lei wrote: > On Tue, Mar 19, 2024 at 04:41:26PM +0100, Martin Wilck wrote: > > > > What we know for sure is that there was a bad dm_target reference > > in > > (struct dm_rq_target_io *tio)->ti: > > > > crash> struct -x dm_rq_target_io c00000245ca90128 > > struct dm_rq_target_io { > > md = 0xc0000031c66a4000, > > ti = 0xc0080000020d0080 <fscache_object_list_lock+665632>, > > > > crash> struct -x dm_target 0xc0080000020d0080 > > struct dm_target struct: invalid kernel virtual address: > > c0080000020d0080 type: "gdb_readmem_callback" > > > > The question is how this could have come to pass. It can only > > happen > > if tio->ti had been set before the map was reloaded. > > My theory is that the IO had been dispatched before the queue had > > been > > quiesced, like this: > > > > Task A Task B > > (dispatching IO) (executing a DM_SUSPEND > > ioctl to > > resume after DM_TABLE_LOAD) > > do_resume() > > dm_suspend() > > __dm_suspend() > > dm_mq_queue_rq() > > struct dm_target *ti = > > md->immutable_target; > > dm_stop_queue() > > > > blk_mq_quiesce_queue() > > /* > > * At this point, the queue is quiesced, but task A > > * has alreadyentered dm_mq_queue_rq() > > */ > > That shouldn't happen, blk_mq_quiesce_queue() drains all pending > dm_mq_queue_rq() and prevents new dm_mq_queue_rq() from being > called. Thanks for pointing this out. I'd been missing the fact that the synchronization is achieved by the rcu_read_lock() in __blk_mq_run_dispatch_ops(), which guards invocations of the request dispatching code against the synchronize_rcu() in blk_mq_wait_quiesce_done(). In our old kernel it was still in hctx_lock(), but with the same effect. This means that don't see any more how our dm_target reference could have pointed to freed memory. For now, we'll follow Mike's advice. Thanks a lot, Martin