Re: [LSF/MM/BPF TOPIC] NVMe HDD

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Thu, 13 Feb 2020 02:40:41 +0000

Ming,

On 2020/02/13 7:03, Ming Lei wrote:
> On Wed, Feb 12, 2020 at 01:47:53AM +0000, Damien Le Moal wrote:
>> On 2020/02/12 4:01, Tim Walker wrote:
>>> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote:
>>>>
>>>> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote:
>>>>> Background:
>>>>>
>>>>> NVMe specification has hardened over the decade and now NVMe devices
>>>>> are well integrated into our customers’ systems. As we look forward,
>>>>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver
>>>>> stack, consolidating on a single access method for rotational and
>>>>> static storage technologies. PCIe-NVMe offers near-SATA interface
>>>>> costs, features and performance suitable for high-cap HDDs, and
>>>>> optimal interoperability for storage automation, tiering, and
>>>>> management. We will share some early conceptual results and proposed
>>>>> salient design goals and challenges surrounding an NVMe HDD.
>>>>
>>>> HDD. performance is very sensitive to IO order. Could you provide some
>>>> background info about NVMe HDD? Such as:
>>>>
>>>> - number of hw queues
>>>> - hw queue depth
>>>> - will NVMe sort/merge IO among all SQs or not?
>>>>
>>>>>
>>>>>
>>>>> Discussion Proposal:
>>>>>
>>>>> We’d like to share our views and solicit input on:
>>>>>
>>>>> -What Linux storage stack assumptions do we need to be aware of as we
>>>>> develop these devices with drastically different performance
>>>>> characteristics than traditional NAND? For example, what schedular or
>>>>> device driver level changes will be needed to integrate NVMe HDDs?
>>>>
>>>> IO merge is often important for HDD. IO merge is usually triggered when
>>>> .queue_rq() returns STS_RESOURCE, so far this condition won't be
>>>> triggered for NVMe SSD.
>>>>
>>>> Also blk-mq kills BDI queue congestion and ioc batching, and causes
>>>> writeback performance regression[1][2].
>>>>
>>>> What I am thinking is that if we need to switch to use independent IO
>>>> path for handling SSD and HDD. IO, given the two mediums are so
>>>> different from performance viewpoint.
>>>>
>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e=
>>>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e=
>>>>
>>>>
>>>> Thanks,
>>>> Ming
>>>>
>>>
>>> I would expect the drive would support a reasonable number of queues
>>> and a relatively deep queue depth, more in line with NVMe practices
>>> than SAS HDD's typical 128. But it probably doesn't make sense to
>>> queue up thousands of commands on something as slow as an HDD, and
>>> many customers keep queues < 32 for latency management.
>>
>> Exposing an HDD through multiple-queues each with a high queue depth is
>> simply asking for troubles. Commands will end up spending so much time
>> sitting in the queues that they will timeout. This can already be observed
>> with the smartpqi SAS HBA which exposes single drives as multiqueue block
>> devices with high queue depth. Exercising these drives heavily leads to
>> thousands of commands being queued and to timeouts. It is fairly easy to
>> trigger this without a manual change to the QD. This is on my to-do list of
>> fixes for some time now (lacking time to do it).
> 
> Just wondering why smartpqi SAS won't set one proper .cmd_per_lun for
> avoiding the issue, looks the driver simply assigns .can_queue to it,
> then it isn't strange to see the timeout issue. If .can_queue is a bit
> big, HDD. is easily saturated too long.
> 
>>
>> NVMe HDDs need to have an interface setup that match their speed, that is,
>> something like a SAS interface: *single* queue pair with a max QD of 256 or
>> less depending on what the drive can take. Their is no TASK_SET_FULL
>> notification on NVMe, so throttling has to come from the max QD of the SQ,
>> which the drive will advertise to the host.
>>
>>> Merge and elevator are important to HDD performance. I don't believe
>>> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort
>>> within a SQ without driving large differences between SSD & HDD data
>>> paths?
>>
>> As far as I know, there is no merging going on once requests are passed to
>> the driver and added to an SQ. So this is beside the point.
>> The current default block scheduler for NVMe SSDs is "none". This is
>> decided based on the number of queues of the device. For NVMe drives that
>> have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their
>> request queue will can fallback to the default spinning rust mq-deadline
>> elevator. That will achieve command merging and LBA ordering needed for
>> good performance on HDDs.
> 
> mq-deadline basically won't merge IO if STS_RESOURCE isn't returned from
> .queue_rq(), or blk_mq_get_dispatch_budget always return true. NVMe's
> .queue_rq() basically always returns STS_OK.

I am confused: when an elevator is set, ->queue_rq() is called for requests
obtained from the elevator (with e->type->ops.dispatch_request()), after
the requests went through it. And merging will happen at that stage when
new requests are inserted in the elevator.

If the ->queue_rq() returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the
request is indeed requeued which offer more chances of further merging, but
that is not the same as no merging happening.
Am I missing your point here ?

> 
> 
> Thanks, 
> Ming
> 
> 

-- 
Damien Le Moal
Western Digital Research