Re: [LSF/MM/BPF TOPIC] NVMe HDD

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Wed, 19 Feb 2020 01:53:53 +0000

On 2020/02/19 10:32, Ming Lei wrote:
> On Wed, Feb 19, 2020 at 02:41:14AM +0900, Keith Busch wrote:
>> On Tue, Feb 18, 2020 at 10:54:54AM -0500, Tim Walker wrote:
>>> With regards to our discussion on queue depths, it's common knowledge
>>> that an HDD choses commands from its internal command queue to
>>> optimize performance. The HDD looks at things like the current
>>> actuator position, current media rotational position, power
>>> constraints, command age, etc to choose the best next command to
>>> service. A large number of commands in the queue gives the HDD a
>>> better selection of commands from which to choose to maximize
>>> throughput/IOPS/etc but at the expense of the added latency due to
>>> commands sitting in the queue.
>>>
>>> NVMe doesn't allow us to pull commands randomly from the SQ, so the
>>> HDD should attempt to fill its internal queue from the various SQs,
>>> according to the SQ servicing policy, so it can have a large number of
>>> commands to choose from for its internal command processing
>>> optimization.
>>
>> You don't need multiple queues for that. While the device has to fifo
>> fetch commands from a host's submission queue, it may reorder their
>> executuion and completion however it wants, which you can do with a
>> single queue.
>>  
>>> It seems to me that the host would want to limit the total number of
>>> outstanding commands to an NVMe HDD
>>
>> The host shouldn't have to decide on limits. NVMe lets the device report
>> it's queue count and depth. It should the device's responsibility to
> 
> Will NVMe HDD support multiple NS? If yes, this queue depth isn't
> enough, given all NSs share this single host queue depth.
> 
>> report appropriate values that maximize iops within your latency limits,
>> and the host will react accordingly.
> 
> Suppose NVMe HDD just wants to support single NS and there is single queue,
> if the device just reports one host queue depth, block layer IO sort/merge
> can only be done when there is device saturation feedback provided.
> 
> So, looks either NS queue depth or per-NS device saturation feedback
> mechanism is needed, otherwise NVMe HDD may have to do internal IO
> sort/merge.

SAS and SATA HDDs today already do internal IO reordering and merging, a
lot. That is partly why even with "none" set as the scheduler, you can see
iops increasing with QD used.

But yes, I think you do have a point with the saturation feedback. This may
be necessary for better scheduling host-side.

> 
> 
> Thanks,
> Ming
> 
> 

-- 
Damien Le Moal
Western Digital Research