On 11/27/24 17:58, Christoph Hellwig wrote: > On Wed, Nov 27, 2024 at 05:17:08PM +0900, Damien Le Moal wrote: >> After all these fixes, the last remaining problem is the zone write >> plug error recovery issuing a report zone which can block if a queue >> freeze was initiated. >> >> That can prevent forward progress and hang the freeze caller. I do not >> see any way to avoid that report zones. I think this could be fixed with >> a magic BLK_MQ_REQ_INTERNAL flag passed to blk_mq_alloc_request() and >> propagated to blk_queue_enter() to forcefully take a queue usage counter >> reference even if a queue freeze was started. That would ensure forward >> progress (i.e. scsi_execute_cmd() or the NVMe equivalent would not block >> forever). Need to think more about that. > > You are talking about disk_zone_wplug_handle_error here, right? Yes. > We should not issue a report zones to a frozen queue, as that would > bypass the freezing protection. I suspect the right thing is to > simply defer the error recovery action until after the queue is > unfrozen. But that is the issue: if we defer the report zones, we cannot make progress with BIOs still plugged in the zone write plug BIO list. These hold a queue usage reference that the queue freeze wait is waiting for. We have to somehow allow that report zones to execute to make progress and empty the zone write plugs of all plugged BIOs. Note that if we were talking about regular writes only, we would not need to care about error recovery as we would simply need to abort all these plugged BIOs (as we know they will fail anyway). But for a correct zone append emulation, we need to recover the zone write pointer to resume the execution of the plugged BIOs. Otherwise, the user would see failed zone append commands that are not suppose to fail unless the drive (or the zone) is dead... > I wonder if the separate error work handler should go away, instead > blk_zone_wplug_bio_work should always check for an error first > and in that case do the report zones. And blk_zone_wplug_handle_write > would always defer to the work queue if there was an error. That would not change a thing. The issue is that if a queue freeze has started, executing a report zone can block on a request allocation (on the blk_queue_enter() it implies if there are no cached requests). So the same problem remains. Though I agree that the error recovery could be moved to the zone BIO work and we could get rid of the error recovery work. But we still need to somehow allow that report zone to execute even if a queue freeze has started... Hence the idea of the BLK_MQ_REQ_INTERNAL flag to allow that, for special cases like this one were completing BIOs depends on first executing another internal command. Or maybe we could try to pre-allocate a request for such case, but managing that request to not have it freed to be able to reuse it until all errors are processed may need many block layer changes... -- Damien Le Moal Western Digital Research