On 10/17/22 07:21, Bart Van Assche wrote: > On 10/15/22 22:20, Chaitanya Kulkarni wrote: >> In current timeout implementation null_blk just completes the request >> with error=BLK_STS_TIMEOUT without doing any cleanup, hence device >> cleanup code including handling inflight requests on timeout and >> teardown is never exercised. > > Hi Chaitanya, > > How about removing that code instead of adding a mechanism for > triggering it? > Can you please elaborate on this ? which code needs to be removed? >> Add a module parameter rq_abort_limit to allow null_blk perform device >> cleanup when time out occurs. The non zero value of this parameter >> allows user to set the number of timeouts to occur before triggering >> cleanup/teardown work. > > As Ming Lei wrote, there are no other block drivers that destroy > themselves if a certain number of timeouts occur. It seems weird to me > to trigger self-removal from inside a timeout handler. > Ming thought I'm proposing first line of action to remove the device in the timeout callback without having to look into the device if it can be aborted and make it functional again, which is I'm not, new module parameter allows to set multiple requests to be timed out before engaging in teardown sequence. nvme-rdma host (and I guess nvme-tcp host) does have a the similar behavior to remove the device from the err_work issued from request timeout callback:- from nvme/host/rdma.c nvme_rdma_timeout() nvme_rdma_error_recovery() nvme_err_work() -> nvme_reset_wq nvme_rdma_error_recovery_work() ... nvme_rdma_tear_down_io_queues() nvme_start_freeze() blk_freeze_queue_start() nvme_stop_queues() nvme_stop_ns_queue() blk_mq_quiesce_queue() or blk_mq_wait_quiesce_done() nvme_sync_io_queues() blk_sync_queue() nvme_start_queues() nvme_start_ns_queue() blk_mq_unquiesce_queue() nvme_rdma_reconnect_or_remove() Also, I've listed the problem that I've seen first hand for keeping the device in the system that is non-responsive due to request timeouts, in that case we should let user decide whether user wants to remove or keep the device in the system instead of forcing user to keep the device in the system bringing down whole system, and these problems are really hard to debug even with Teledyne LeCroy [1]. This patch follows the same philosophy where user can decide to opt in for removal with module parameter. Once opt-in user knows what he is getting into. -ck [1] https://teledynelecroy.com/protocolanalyzer/pci-express/interposers-and-probes