On 2020/02/13 16:54, Ming Lei wrote: > On Thu, Feb 13, 2020 at 02:40:41AM +0000, Damien Le Moal wrote: >> Ming, >> >> On 2020/02/13 7:03, Ming Lei wrote: >>> On Wed, Feb 12, 2020 at 01:47:53AM +0000, Damien Le Moal wrote: >>>> On 2020/02/12 4:01, Tim Walker wrote: >>>>> On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote: >>>>>> >>>>>> On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote: >>>>>>> Background: >>>>>>> >>>>>>> NVMe specification has hardened over the decade and now NVMe devices >>>>>>> are well integrated into our customers’ systems. As we look forward, >>>>>>> moving HDDs to the NVMe command set eliminates the SAS IOC and driver >>>>>>> stack, consolidating on a single access method for rotational and >>>>>>> static storage technologies. PCIe-NVMe offers near-SATA interface >>>>>>> costs, features and performance suitable for high-cap HDDs, and >>>>>>> optimal interoperability for storage automation, tiering, and >>>>>>> management. We will share some early conceptual results and proposed >>>>>>> salient design goals and challenges surrounding an NVMe HDD. >>>>>> >>>>>> HDD. performance is very sensitive to IO order. Could you provide some >>>>>> background info about NVMe HDD? Such as: >>>>>> >>>>>> - number of hw queues >>>>>> - hw queue depth >>>>>> - will NVMe sort/merge IO among all SQs or not? >>>>>> >>>>>>> >>>>>>> >>>>>>> Discussion Proposal: >>>>>>> >>>>>>> We’d like to share our views and solicit input on: >>>>>>> >>>>>>> -What Linux storage stack assumptions do we need to be aware of as we >>>>>>> develop these devices with drastically different performance >>>>>>> characteristics than traditional NAND? For example, what schedular or >>>>>>> device driver level changes will be needed to integrate NVMe HDDs? >>>>>> >>>>>> IO merge is often important for HDD. IO merge is usually triggered when >>>>>> .queue_rq() returns STS_RESOURCE, so far this condition won't be >>>>>> triggered for NVMe SSD. >>>>>> >>>>>> Also blk-mq kills BDI queue congestion and ioc batching, and causes >>>>>> writeback performance regression[1][2]. >>>>>> >>>>>> What I am thinking is that if we need to switch to use independent IO >>>>>> path for handling SSD and HDD. IO, given the two mediums are so >>>>>> different from performance viewpoint. >>>>>> >>>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e= >>>>>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e= >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Ming >>>>>> >>>>> >>>>> I would expect the drive would support a reasonable number of queues >>>>> and a relatively deep queue depth, more in line with NVMe practices >>>>> than SAS HDD's typical 128. But it probably doesn't make sense to >>>>> queue up thousands of commands on something as slow as an HDD, and >>>>> many customers keep queues < 32 for latency management. >>>> >>>> Exposing an HDD through multiple-queues each with a high queue depth is >>>> simply asking for troubles. Commands will end up spending so much time >>>> sitting in the queues that they will timeout. This can already be observed >>>> with the smartpqi SAS HBA which exposes single drives as multiqueue block >>>> devices with high queue depth. Exercising these drives heavily leads to >>>> thousands of commands being queued and to timeouts. It is fairly easy to >>>> trigger this without a manual change to the QD. This is on my to-do list of >>>> fixes for some time now (lacking time to do it). >>> >>> Just wondering why smartpqi SAS won't set one proper .cmd_per_lun for >>> avoiding the issue, looks the driver simply assigns .can_queue to it, >>> then it isn't strange to see the timeout issue. If .can_queue is a bit >>> big, HDD. is easily saturated too long. >>> >>>> >>>> NVMe HDDs need to have an interface setup that match their speed, that is, >>>> something like a SAS interface: *single* queue pair with a max QD of 256 or >>>> less depending on what the drive can take. Their is no TASK_SET_FULL >>>> notification on NVMe, so throttling has to come from the max QD of the SQ, >>>> which the drive will advertise to the host. >>>> >>>>> Merge and elevator are important to HDD performance. I don't believe >>>>> NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort >>>>> within a SQ without driving large differences between SSD & HDD data >>>>> paths? >>>> >>>> As far as I know, there is no merging going on once requests are passed to >>>> the driver and added to an SQ. So this is beside the point. >>>> The current default block scheduler for NVMe SSDs is "none". This is >>>> decided based on the number of queues of the device. For NVMe drives that >>>> have only a single queue *AND* the QUEUE_FLAG_NONROT flag cleared in their >>>> request queue will can fallback to the default spinning rust mq-deadline >>>> elevator. That will achieve command merging and LBA ordering needed for >>>> good performance on HDDs. >>> >>> mq-deadline basically won't merge IO if STS_RESOURCE isn't returned from >>> .queue_rq(), or blk_mq_get_dispatch_budget always return true. NVMe's >>> .queue_rq() basically always returns STS_OK. >> >> I am confused: when an elevator is set, ->queue_rq() is called for requests >> obtained from the elevator (with e->type->ops.dispatch_request()), after >> the requests went through it. And merging will happen at that stage when >> new requests are inserted in the elevator. > > When request is queued to lld via .queue_rq(), the request has been > removed from scheduler queue. And IO merge is just done inside or > against scheduler queue. Yes, for incoming new BIOs, not for requests passed to the LLD. >> If the ->queue_rq() returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, the >> request is indeed requeued which offer more chances of further merging, but >> that is not the same as no merging happening. >> Am I missing your point here ? > > BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE or getting no budget can be > thought as device saturation feedback, then more requests can be > gathered in scheduler queue since we don't dequeue request from > scheduler queue when that happens, then IO merge is possible. > > Without any device saturation feedback from driver, block layer just > dequeues request from scheduler queue with same speed of submission to > hardware, then no IO can be merged. Got it. And since queue full will mean no more tags, submission will block on get_request() and there will be no chance in the elevator to merge anything (aside from opportunistic merging in plugs), isn't it ? So I guess NVMe HDDs will need some tuning in this area. > > If you observe sequential IO on NVMe PCI, you will see no IO merge > basically. > > > Thanks, > Ming > > -- Damien Le Moal Western Digital Research