Re: [RFC Patch] dm: make sure to wait for all dispatched requests in __dm_suspend()

Ming Lei <ming.lei@xxxxxxxxxx> · Wed, 20 Mar 2024 11:03:48 +0800

On Tue, Mar 19, 2024 at 04:41:26PM +0100, Martin Wilck wrote:
> Hello Ming,
> 
> On Tue, 2024-03-19 at 21:04 +0800, Ming Lei wrote:
> > Hello Martin,
> > 
> > On Sat, Mar 16, 2024 at 12:10:35AM +0100, Martin Wilck wrote:
> > > In a recent kernel dump analysis, we found that the kernel crashed
> > > because
> > > dm_rq_target_io tio->ti was pointing to invalid memory in
> > > dm_end_request(),
> > > in a situation where multipathd was doing map reloads because of a
> > > storage
> > > failover. The map of the respective mapped_device had been replaced
> > > by a
> > > different struct dm_table.
> > > 
> > > We obverved this with a 5.3.18 distro kernel, but the code in
> > > question
> > > hasn't change much since then. Basically, we were only missing
> > > b4459b11e840 ("dm rq: don't queue request to blk-mq during DM
> > > suspend"),
> > > which doesn't guarantee that the race I'm thinking of (see below)
> > > can't
> > > happen.
> > > 
> > > When a map is resumed after a table reload, the live table is
> > > swapped, and
> > > the tio->ti member of any live request becomes stale. 
> > > __dm_resume() avoids
> > > this by quiescing the queue and calling dm_wait_for_completion(),
> > > which
> > > waits until blk_mq_queue_inflight() doesn't report any in-flight
> > > requests.
> > > 
> > > However, blk_mq_queue_inflight() counts only "started" requests.
> > > So, if a
> > > request is dispatched before the queue was quiesced, but
> > > dm_wait_for_completion() doesn't observe MQ_RQ_IN_FLIGHT for this
> > > request
> > > because of memory ordering effects, __dm_suspend() may finish
> > > successfully,
> > 
> > Can you explain a bit about the exact memory order which causes
> > MQ_RQ_IN_FLIGHT
> > not observed?
> > 
> > blk-mq quiesce includes synchronize_rcu() which drains all in-flight
> > dispatch, so after blk_mq_quiesce_queue() returns,
> > if blk_mq_queue_inflight() returns 0, it does mean there isn't any
> > active inflight requests.
> > 
> > If there is bug in this pattern, I guess more drivers may have such
> > 'risk'.
> 
> What we know for sure is that there was a bad dm_target reference in
> (struct dm_rq_target_io *tio)->ti:
> 
> crash> struct -x dm_rq_target_io c00000245ca90128
> struct dm_rq_target_io {
>   md = 0xc0000031c66a4000,
>   ti = 0xc0080000020d0080 <fscache_object_list_lock+665632>,
> 
> crash> struct -x dm_target  0xc0080000020d0080
> struct dm_target struct: invalid kernel virtual address:
> c0080000020d0080  type: "gdb_readmem_callback"
> 
> The question is how this could have come to pass. It can only happen
> if tio->ti had been set before the map was reloaded. 
> My theory is that the IO had been dispatched before the queue had been
> quiesced, like this:
> 
> Task A                                 Task B
> (dispatching IO)                       (executing a DM_SUSPEND ioctl to
>                                        resume after DM_TABLE_LOAD)
>                                        do_resume()
>                                          dm_suspend()
>                                            __dm_suspend()
> dm_mq_queue_rq()                         
>    struct dm_target *ti =                
>        md->immutable_target;                  
>                                               dm_stop_queue()
>                                                  blk_mq_quiesce_queue()
>        /* 
>         * At this point, the queue is quiesced, but task A
>         * has alreadyentered dm_mq_queue_rq()
>         */

That shouldn't happen, blk_mq_quiesce_queue() drains all pending
dm_mq_queue_rq() and prevents new dm_mq_queue_rq() from being
called.

>                                               dm_wait_for_completion()
>                                                  blk_mq_queue_inflight()
>        /* 
>         * blk_mq_queue_inflight() doesn't see Task A's 
>         * request because it isn't started yet
>         */
>                                               set_bit(dmf_suspended_flag)
>    dm_start_request(md, rq);             dm_swap_table()
>                                             __bind()
>                                               md->immutable_target = ...
>                                          dm_target_destroy()
>         /* the previous md->immutable_target is freed */
>    init_tio(tio, rq, md);
>         /* the stale ti pointer is assigned to tio->ti */
>    tio->ti = ti;
> 
> dm_mq_queue_rq() contains no synchronization code if 
> md->immutable_target is set, so I think that this can happen, even
> though it looks unlikely. With b4459b11e840 (which was not applied in
> the customer kernel), there would be a
> set_bit(DMF_BLOCK_IO_FOR_SUSPEND) statement before dm_stop_queue(),
> but IMO that the above would still be possible.
> 
> If this can't happen, I have no more ideas how the observed situation
> came to pass. The customer who sent us the core claims that
> he has seen this multiple times already (but we have only this single
> core dump).

Your kernel is v5.3, which is too old, so as Mike suggested, please
try b4459b11e840 ("dm rq: don't queue request to blk-mq during DM
suspend") given blk-mq has too many changes here.

IMO, this issue should happen in upstream kernel.

Thanks,
Ming