In a recent kernel dump analysis, we found that the kernel crashed because dm_rq_target_io tio->ti was pointing to invalid memory in dm_end_request(), in a situation where multipathd was doing map reloads because of a storage failover. The map of the respective mapped_device had been replaced by a different struct dm_table. We obverved this with a 5.3.18 distro kernel, but the code in question hasn't change much since then. Basically, we were only missing b4459b11e840 ("dm rq: don't queue request to blk-mq during DM suspend"), which doesn't guarantee that the race I'm thinking of (see below) can't happen. When a map is resumed after a table reload, the live table is swapped, and the tio->ti member of any live request becomes stale. __dm_resume() avoids this by quiescing the queue and calling dm_wait_for_completion(), which waits until blk_mq_queue_inflight() doesn't report any in-flight requests. However, blk_mq_queue_inflight() counts only "started" requests. So, if a request is dispatched before the queue was quiesced, but dm_wait_for_completion() doesn't observe MQ_RQ_IN_FLIGHT for this request because of memory ordering effects, __dm_suspend() may finish successfully, even though there is an active request, and the resume code path may proceed through dm_swap_table() and dm_table_destroy(). The request that sneaked through will then carry an invalid dm_target reference in tio->ti. This is how I believe the above crash came to pass. This patch tries to fix this by setting the "started" status of the request inside a SRCU read side critical section. In __dm_suspend(), synchronize_srcu(&md->io_barrier) is called after setting DMF_BLOCK_IO_FOR_SUSPEND. The read side critical section is executed completely, with all its memory effects, either before or after the synchronize_srcu(). In the first case, dm_wait_for_completion() will observe the in-flight status. Otherwise, dm_mq_queue_rq will see DMF_BLOCK_IO_FOR_SUSPEND and give up. In both cases, there will be no stale reference in tio->target when the request finishes. Signed-off-by: Martin Wilck <mwilck@xxxxxxxx> Cc: Alasdair G Kergon <agk@xxxxxxxxxx> Cc: Mike Snitzer <snitzer@xxxxxxxxxx> Cc: Mikulas Patocka <mpatocka@xxxxxxxxxx> Cc: Ming Lei <ming.lei@xxxxxxxxxx> Cc: Hannes Reinecke <hare@xxxxxxx> Cc: Vasilis Liaskovitis <vliaskovitis@xxxxxxxx> --- drivers/md/dm-rq.c | 51 ++++++++++++++++++++++++++++++---------------- 1 file changed, 33 insertions(+), 18 deletions(-) diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index f7e9a3632eb3..7f6c83eb2e5b 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -481,6 +481,10 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, struct dm_rq_target_io *tio = blk_mq_rq_to_pdu(rq); struct mapped_device *md = tio->md; struct dm_target *ti = md->immutable_target; + struct dm_table *map; + int srcu_idx; + + map = dm_get_live_table(md, &srcu_idx); /* * blk-mq's unquiesce may come from outside events, such as @@ -488,25 +492,31 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, * come during suspend, so simply ask for blk-mq to requeue it. */ if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) - return BLK_STS_RESOURCE; + goto out_put_table; + + /* + * If this request got dispatched before the queue was stopped in + * __dm_suspend(), make sure dm_wait_for_completion() will + * see this request as in flight. Otherwise this request may be actually + * serviced while the table is swapped, and when it finishes, tio->ti + * may reference a stale target. + * As we're in a read-side critical section here, the synchronize_srcu() + * in __dm_suspend() guarantees that if we haven't seen + * DMF_BLOCK_IO_FOR_SUSPEND here, __dm_suspend() will observe the + * in flight status. + */ + dm_start_request(md, rq); if (unlikely(!ti)) { - int srcu_idx; - struct dm_table *map; - - map = dm_get_live_table(md, &srcu_idx); - if (unlikely(!map)) { - dm_put_live_table(md, srcu_idx); - return BLK_STS_RESOURCE; - } + if (unlikely(!map)) + goto out_put_table; ti = dm_table_find_target(map, 0); - dm_put_live_table(md, srcu_idx); } + dm_put_live_table(md, srcu_idx); if (ti->type->busy && ti->type->busy(ti)) - return BLK_STS_RESOURCE; + goto out_requeue; - dm_start_request(md, rq); /* Init tio using md established in .init_request */ init_tio(tio, rq, md); @@ -517,14 +527,19 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, tio->ti = ti; /* Direct call is fine since .queue_rq allows allocations */ - if (map_request(tio) == DM_MAPIO_REQUEUE) { - /* Undo dm_start_request() before requeuing */ - rq_end_stats(md, rq); - rq_completed(md); - return BLK_STS_RESOURCE; - } + if (map_request(tio) == DM_MAPIO_REQUEUE) + goto out_requeue; return BLK_STS_OK; + +out_put_table: + dm_put_live_table(md, srcu_idx); + +out_requeue: + /* Undo dm_start_request() before requeuing */ + rq_end_stats(md, rq); + rq_completed(md); + return BLK_STS_RESOURCE; } static const struct blk_mq_ops dm_mq_ops = { -- 2.43.2