Re: [blktests] zbd/012: Test requeuing of zoned writes and queue freezing

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Tue, 26 Nov 2024 22:16:18 -0800

On Wed, Nov 27, 2024 at 02:18:34PM +0900, Damien Le Moal wrote:
> After some debugging, I understood the issue. It is not a problem with the
> queue usage counter but rather an issue with REQ_NOWAIT BIOs that may be failed
> with bio_wouldblock_error() *after* having been processed by
> blk_zone_plug_bio(). E.g. blk_mq_get_new_requests() may fail to get a request
> due to REQ_NOWAIT being set and fail the BIO using bio_wouldblock_error(). This
> in turn will lead to a call to disk_zone_wplug_set_error() which will mark the
> zone write plug with the ERROR flag. However, since the BIO failure is not from
> a failed request, we are not calling disk_zone_wplug_unplug_bio(), which if
> called would run the error recovery for the zone. That is, the zone write plug
> of the BLK_STS_AGAIN failed BIO is left "busy", which result in the BIO to be
> added to the plug BIO list when it is re-submitted again. But we donot have any
> write BIO on-going for this zone, the BIO ends up stuck in the zone write plug
> list holding a queue reference count, which causes the freeze to never terminate.

Did you trace where the bio_wouldblock_error is coming from?  Probably
a failing request allocation?  Can we call the guts of blk_zone_plug_bio
after allocating the request to avoid this?