Re: [PATCH v6 1/4] block: Make fair tag sharing configurable

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Thu, 1 Feb 2024 08:52:54 +0900

On 2/1/24 08:41, Bart Van Assche wrote:
> On 1/31/24 15:04, Keith Busch wrote:
>> I didn't have anything in mind; just that protocols don't require all
>> commands be fast.
> 
> The default block layer timeout is 30 seconds because typical storage 
> commands complete in much less than 30 seconds.
> 
>> NVMe has wait event commands that might not ever complete.
> 
> Are you perhaps referring to the NVMe Asynchronous Event Request
> command? That command doesn't count because the command ID for that
> command comes from another set than I/O commands. From the NVMe
> driver:
> 
> static inline bool nvme_is_aen_req(u16 qid, __u16 command_id)
> {
> 	return !qid &&
> 		nvme_tag_from_cid(command_id) >= NVME_AQ_BLK_MQ_DEPTH;
> }
> 
>> A copy command requesting multiple terabyes won't be quick for even the
>> fastest hardware (not "hours", but not fast).
> 
> Is there any setup in which such large commands are submitted? Write
> commands that last long may negatively affect read latency. This is a
> good reason not to make the max_sectors value too large.

Even if max_sectors is not very large, if the device has a gigantic write cache
that needs to be flushed first to be able to process an incoming write, then
writes can be slow. I have seen issues in the field with that causing timeouts.
Even a worst case scenario: HDDs doing on-media caching of writes when the
volatile write cache is disabled by the user. Then if the on-media write cache
needs to be freed up for a new write, the HDD will be very very slow handling
writes. There are plenty of scenarios out there where the device can suddenly
become slow, hogging a lot of tags in the process.

>> If hardware stops responding, the tags are locked up for as long as it
>> takes recovery escalation to reclaim them. For nvme, error recovery
>> could take over a minute by default.
> 
> If hardware stops responding, who cares about fairness of tag allocation 
> since this means that request processing halts for all queues associated
> with the controller that locked up?

Considering the above, it would be more about horrible slowdown of all devices
sharing the tagset because for whatever reason one of the device is slow.

Note: this is only my 2 cents input. I have not seen any issue in practice with
shared tagset, but I do not think I ever encountered a system actually using
that feature :)

-- 
Damien Le Moal
Western Digital Research