On Tue, Feb 11, 2020 at 02:01:18PM -0500, Tim Walker wrote: > On Tue, Feb 11, 2020 at 7:28 AM Ming Lei <ming.lei@xxxxxxxxxx> wrote: > > > > On Mon, Feb 10, 2020 at 02:20:10PM -0500, Tim Walker wrote: > > > Background: > > > > > > NVMe specification has hardened over the decade and now NVMe devices > > > are well integrated into our customers’ systems. As we look forward, > > > moving HDDs to the NVMe command set eliminates the SAS IOC and driver > > > stack, consolidating on a single access method for rotational and > > > static storage technologies. PCIe-NVMe offers near-SATA interface > > > costs, features and performance suitable for high-cap HDDs, and > > > optimal interoperability for storage automation, tiering, and > > > management. We will share some early conceptual results and proposed > > > salient design goals and challenges surrounding an NVMe HDD. > > > > HDD. performance is very sensitive to IO order. Could you provide some > > background info about NVMe HDD? Such as: > > > > - number of hw queues > > - hw queue depth > > - will NVMe sort/merge IO among all SQs or not? > > > > > > > > > > > Discussion Proposal: > > > > > > We’d like to share our views and solicit input on: > > > > > > -What Linux storage stack assumptions do we need to be aware of as we > > > develop these devices with drastically different performance > > > characteristics than traditional NAND? For example, what schedular or > > > device driver level changes will be needed to integrate NVMe HDDs? > > > > IO merge is often important for HDD. IO merge is usually triggered when > > .queue_rq() returns STS_RESOURCE, so far this condition won't be > > triggered for NVMe SSD. > > > > Also blk-mq kills BDI queue congestion and ioc batching, and causes > > writeback performance regression[1][2]. > > > > What I am thinking is that if we need to switch to use independent IO > > path for handling SSD and HDD. IO, given the two mediums are so > > different from performance viewpoint. > > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_Pine.LNX.4.44L0.1909181213141.1507-2D100000-40iolanthe.rowland.org_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=tsnFP8bQIAq7G66B75LTe3vo4K14HbL9JJKsxl_LPAw&e= > > [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Dscsi_20191226083706.GA17974-40ming.t460p_&d=DwIFaQ&c=IGDlg0lD0b-nebmJJ0Kp8A&r=NW1X0yRHNNEluZ8sOGXBxCbQJZPWcIkPT0Uy3ynVsFU&m=pSnHpt_uQQ73JV4VIQg1C_PVAcLvqBBtmyxQHwWjGSM&s=GJwSxXtc_qZHKnrTqSbytUjuRrrQgZpvV3bxZYFDHe4&e= > > > > > > Thanks, > > Ming > > > > I would expect the drive would support a reasonable number of queues > and a relatively deep queue depth, more in line with NVMe practices > than SAS HDD's typical 128. But it probably doesn't make sense to > queue up thousands of commands on something as slow as an HDD, and > many customers keep queues < 32 for latency management. MQ & deep queue depth will cause trouble for HDD., as Damien mentioned, IO timeout may be caused. Then looks you need to add per-ns queue depth, just like what sdev->device_busy does for avoiding IO timeout. On the other hand, with per-ns queue depth, you may prevent IO submitted to NVMe when this ns is saturated, then block layer's IO merge can be triggered. > > Merge and elevator are important to HDD performance. I don't believe > NVMe should attempt to merge/sort across SQs. Can NVMe merge/sort > within a SQ without driving large differences between SSD & HDD data > paths? If NVMe doesn't sort/merge across SQs, it should be better to just use single queue for HDD. Otherwise, it is easy to break IO order & merge. Even someone complains that sequential IO becomes dis-continuous on NVMe(SSD) when arbitration burst is less than IO queue depth. It is said fio performance is hurt, but I don't understand how that can happen on SSD. Thanks, Ming