Re: [RFC Patch] dm: make sure to wait for all dispatched requests in __dm_suspend()

Martin Wilck <mwilck@xxxxxxxx> · Tue, 19 Mar 2024 16:41:26 +0100

Hello Ming,

On Tue, 2024-03-19 at 21:04 +0800, Ming Lei wrote:
> Hello Martin,
> 
> On Sat, Mar 16, 2024 at 12:10:35AM +0100, Martin Wilck wrote:
> > In a recent kernel dump analysis, we found that the kernel crashed
> > because
> > dm_rq_target_io tio->ti was pointing to invalid memory in
> > dm_end_request(),
> > in a situation where multipathd was doing map reloads because of a
> > storage
> > failover. The map of the respective mapped_device had been replaced
> > by a
> > different struct dm_table.
> > 
> > We obverved this with a 5.3.18 distro kernel, but the code in
> > question
> > hasn't change much since then. Basically, we were only missing
> > b4459b11e840 ("dm rq: don't queue request to blk-mq during DM
> > suspend"),
> > which doesn't guarantee that the race I'm thinking of (see below)
> > can't
> > happen.
> > 
> > When a map is resumed after a table reload, the live table is
> > swapped, and
> > the tio->ti member of any live request becomes stale. 
> > __dm_resume() avoids
> > this by quiescing the queue and calling dm_wait_for_completion(),
> > which
> > waits until blk_mq_queue_inflight() doesn't report any in-flight
> > requests.
> > 
> > However, blk_mq_queue_inflight() counts only "started" requests.
> > So, if a
> > request is dispatched before the queue was quiesced, but
> > dm_wait_for_completion() doesn't observe MQ_RQ_IN_FLIGHT for this
> > request
> > because of memory ordering effects, __dm_suspend() may finish
> > successfully,
> 
> Can you explain a bit about the exact memory order which causes
> MQ_RQ_IN_FLIGHT
> not observed?
> 
> blk-mq quiesce includes synchronize_rcu() which drains all in-flight
> dispatch, so after blk_mq_quiesce_queue() returns,
> if blk_mq_queue_inflight() returns 0, it does mean there isn't any
> active inflight requests.
> 
> If there is bug in this pattern, I guess more drivers may have such
> 'risk'.

What we know for sure is that there was a bad dm_target reference in
(struct dm_rq_target_io *tio)->ti:

crash> struct -x dm_rq_target_io c00000245ca90128
struct dm_rq_target_io {
  md = 0xc0000031c66a4000,
  ti = 0xc0080000020d0080 <fscache_object_list_lock+665632>,

crash> struct -x dm_target  0xc0080000020d0080
struct dm_target struct: invalid kernel virtual address:
c0080000020d0080  type: "gdb_readmem_callback"

The question is how this could have come to pass. It can only happen
if tio->ti had been set before the map was reloaded. 
My theory is that the IO had been dispatched before the queue had been
quiesced, like this:

Task A                                 Task B
(dispatching IO)                       (executing a DM_SUSPEND ioctl to
                                       resume after DM_TABLE_LOAD)
                                       do_resume()
                                         dm_suspend()
                                           __dm_suspend()
dm_mq_queue_rq()                         
   struct dm_target *ti =                
       md->immutable_target;                  
                                              dm_stop_queue()
                                                 blk_mq_quiesce_queue()
       /* 
        * At this point, the queue is quiesced, but task A
        * has alreadyentered dm_mq_queue_rq()
        */
                                              dm_wait_for_completion()
                                                 blk_mq_queue_inflight()
       /* 
        * blk_mq_queue_inflight() doesn't see Task A's 
        * request because it isn't started yet
        */
                                              set_bit(dmf_suspended_flag)
   dm_start_request(md, rq);             dm_swap_table()
                                            __bind()
                                              md->immutable_target = ...
                                         dm_target_destroy()
        /* the previous md->immutable_target is freed */
   init_tio(tio, rq, md);
        /* the stale ti pointer is assigned to tio->ti */
   tio->ti = ti;

dm_mq_queue_rq() contains no synchronization code if 
md->immutable_target is set, so I think that this can happen, even
though it looks unlikely. With b4459b11e840 (which was not applied in
the customer kernel), there would be a
set_bit(DMF_BLOCK_IO_FOR_SUSPEND) statement before dm_stop_queue(),
but IMO that the above would still be possible.

If this can't happen, I have no more ideas how the observed situation
came to pass. The customer who sent us the core claims that
he has seen this multiple times already (but we have only this single
core dump).

> BTW, what is the underlying disks in your dm-mpath setting?

It was ibmvfc.

Thanks for looking into this,
Martin