Re: [blktests] zbd/012: Test requeuing of zoned writes and queue freezing

Damien Le Moal <dlemoal@xxxxxxxxxx> · Wed, 27 Nov 2024 20:31:43 +0900

On 11/27/24 17:58, Christoph Hellwig wrote:
> On Wed, Nov 27, 2024 at 05:17:08PM +0900, Damien Le Moal wrote:
>> After all these fixes, the last remaining problem is the zone write
>> plug error recovery issuing a report zone which can block if a queue 
>> freeze was initiated.
>>
>> That can prevent forward progress and hang the freeze caller. I do not
>> see any way to avoid that report zones. I think this could be fixed with
>> a magic BLK_MQ_REQ_INTERNAL flag passed to blk_mq_alloc_request() and
>> propagated to blk_queue_enter() to forcefully take a queue usage counter
>> reference even if a queue freeze was started. That would ensure forward
>> progress (i.e. scsi_execute_cmd() or the NVMe equivalent would not block
>> forever). Need to think more about that.
> 
> You are talking about disk_zone_wplug_handle_error here, right?

Yes.

> We should not issue a report zones to a frozen queue, as that would
> bypass the freezing protection.  I suspect the right thing is to
> simply defer the error recovery action until after the queue is
> unfrozen.

But that is the issue: if we defer the report zones, we cannot make progress
with BIOs still plugged in the zone write plug BIO list. These hold a queue
usage reference that the queue freeze wait is waiting for. We have to somehow
allow that report zones to execute to make progress and empty the zone write
plugs of all plugged BIOs.

Note that if we were talking about regular writes only, we would not need to
care about error recovery as we would simply need to abort all these plugged
BIOs (as we know they will fail anyway). But for a correct zone append
emulation, we need to recover the zone write pointer to resume the execution of
the plugged BIOs. Otherwise, the user would see failed zone append commands that
are not suppose to fail unless the drive (or the zone) is dead...

> I wonder if the separate error work handler should go away, instead
> blk_zone_wplug_bio_work should always check for an error first
> and in that case do the report zones.  And blk_zone_wplug_handle_write
> would always defer to the work queue if there was an error.

That would not change a thing. The issue is that if a queue freeze has started,
executing a report zone can block on a request allocation (on the
blk_queue_enter() it implies if there are no cached requests). So the same
problem remains.

Though I agree that the error recovery could be moved to the zone BIO work and
we could get rid of the error recovery work.

But we still need to somehow allow that report zone to execute even if a queue
freeze has started... Hence the idea of the BLK_MQ_REQ_INTERNAL flag to allow
that, for special cases like this one were completing BIOs depends on first
executing another internal command. Or maybe we could try to pre-allocate a
request for such case, but managing that request to not have it freed to be able
to reuse it until all errors are processed may need many block layer changes...

-- 
Damien Le Moal
Western Digital Research