Hi, Any comment on this series? Thanks, Jeffle On 2/8/21 4:52 PM, Jeffle Xu wrote: > > [Performance] > 1. One thread (numjobs=1) randread (bs=4k, direct=1) one dm-linear > device, which is built upon 3 nvme devices, with one polling hw > queue per nvme device. > > | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff > ---- | --------------- | -------------------- | ---- > dm | 208k | 279k | ~34% > > > 2. Three threads (numjobs=3) randread (bs=4k, direct=1) one dm-linear > device, which is built upon 3 nvme devices, with one polling hw > queue per nvme device. > > It's compared to 3 threads directly randread 3 nvme devices, with one > polling hw queue per nvme device. No CPU affinity set for these 3 > threads. Thus every thread can access every nvme device > (filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1), i.e., every thread > need to competing for every polling hw queue. > > | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff > ---- | --------------- | -------------------- | ---- > dm | 615k | 728k | ~18% > nvme | 728k | 873k | ~20% > > The result shows that the performance gain of bio-based polling is > comparable as that of mq polling in the same test case. > > > 3. Three threads (numjobs=3) randread (bs=12k, direct=1) one > **dm-stripe** device, which is built upon 3 nvme devices, with one > polling hw queue per nvme device. > > It's compared to 3 threads directly randread 3 nvme devices, with one > polling hw queue per nvme device. No CPU affinity set for these 3 > threads. Thus every thread can access every nvme device > (filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1), i.e., every thread > need to competing for every polling hw queue. > > | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff > ---- | --------------- | -------------------- | ---- > dm | 314k | 354k | ~13% > nvme | 728k | 873k | ~20% > > > 4. This patchset shall do no harm to the performance of the original mq > polling. Following is the test results of one thread (numjobs=1) > randread (bs=4k, direct=1) one nvme device. > > | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff > ---------------- | --------------- | -------------------- | ---- > without patchset | 242k | 332k | ~39% > with patchset | 236k | 332k | ~39% > > > > [Changes since v2] > > Patchset v2 caches all hw queues (in polling mode) of underlying mq > devices in dm layer. The polling routine actually iterates through all > these cached hw queues. > > However, mq may change the queue mapping at runtime (e.g., NVMe RESET > command), thus the cached hw queues in dm layer may be out-of-date. Thus > patchset v3 falls back to the implementation of the very first RFC > version, in which the mq layer needs to export one interface iterating > all polling hw queues (patch 5), and the bio-based polling routine just > calls this interface to iterate all polling hw queues. > > Besides, several new optimization is proposed. > > > - patch 1,2,7 > same as v2, untouched > > - patch 3 > Considering advice from Christoph Hellwig, while refactoring blk_poll(), > split mq and bio-based polling routine from the very beginning. Now > blk_poll() is just a simple entry. blk_bio_poll() is simply copied from > blk_mq_poll(), while the loop structure is some sort of duplication > though. > > - patch 4 > This patch is newly added to support turning on/off polling through > '/sys/block/<dev>/queue/io_poll' dynamiclly for bio-based devices. > Patchset v2 implemented this functionality by added one new queue flag, > which is not preferred since the queue flag resource is quite short of > nowadays. > > - patch 5 > This patch is newly added, preparing for the following bio-based > polling. The following bio-based polling will call this helper function, > accounting on the corresponding hw queue. > > - patch 6 > It's from the very first RFC version, preparing for the following > bio-based polling. > > - patch 8 > One fixing patch needed by the following bio-based polling. It's > actually a v2 of [1]. I had sent the v2 singly in-reply-to [1], though > it has not been visible on the mailing list maybe due to the delay. > > - patch 9 > It's from the very first RFC version. > > - patch 10 > This patch is newly added. Patchset v2 had ever proposed one > optimization that, skipping the **busy** hw queues during the iteration > phase. Back upon that time, one flag of 'atomic_t' is specifically > maintained in dm layer, representing if the corresponding hw queue is > busy or not. The idea is inherited, while the implementation changes. > Now @nvmeq->cq_poll_lock is used directly here, no need for extra flag > anymore. > > This optimization can significantly reduce the competition for one hw > queue between multiple polling instances. Following statistics is the > test result when 3 threads concurrently randread (bs=4k, direct=1) one > dm-linear device, which is built upon 3 nvme devices, with one polling > hw queue per nvme device. > > | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff > ----------- | --------------- | -------------------- | ---- > without opt | 318k | 256k | ~-20% > with opt | 314k | 354k | ~13% > > > - patch 11 > This is another newly added optimizatin for bio-based polling. > > One intuitive insight is that, when the original bio submitted to dm > device doesn't get split, then the bio gets enqueued into only one hw > queue of one of the underlying mq devices. In this case, we no longer > need to track all split bios, and one cookie (for the only split bio) > is enough. It is implemented by returning the pointer to the > corresponding hw queue in this case. > > It should be safe by directly returning the pointer to the hw queue, > since 'struct blk_mq_hw_ctx' won't be freed during the whole lifetime of > 'struct request_queue'. Even when the number of hw queues may decrease > when NVMe RESET happens, the 'struct request_queue' structure of decreased > hw queues won't be freed, instead it's buffered into > &q->unused_hctx_list list. > > Though this optimization seems quite intuitive, the performance test > shows that it does no benefit nor harm to the performance, while 3 > threads concurrently randreading (bs=4k, direct=1) one dm-linear > device, which is built upon 3 nvme devices, with one polling hw queue > per nvme device. > > I'm not sure why it doesn't work, maybe because the number of devices, > or the depth of the devcice stack is to low in my test case? > > > [Remained Issue] > It has been mentioned in patch 4 that, users could change the state of > the underlying devices through '/sys/block/<dev>/io_poll', bypassing > the dm device above. Thus it can cause a situation where QUEUE_FLAG_POLL > is still set for the request_queue of dm device, while one of the > underlying mq device may has cleared this flag. > > In this case, it will pass the 'test_bit(QUEUE_FLAG_POLL, &q->queue_flags)' > check in blk_poll(), while the input cookie may actually points to a hw > queue in IRQ mode since patch 11. Thus for this hw queue (in IRQ mode), > the bio-based polling routine will handle this hw queue acquiring > 'spin_lock(&nvmeq->cq_poll_lock)' (refer > drivers/nvme/host/pci.c:nvme_poll), which is not adequate since this hw > queue may also be accessed in IRQ context. In other words, > spin_lock_irq() should be used here. > > I have not come up one simple way to fix it. I don't want to do sanity > check (e.g., the type of the hw queue is HCTX_TYPE_POLL or not) in the > IO path (submit_bio()/blk_poll()), i.e., fast path. > > We'd better fix it in the control path, i.e., dm could be aware of the > change when attribute (e.g., support io_poll or not) of one of the > underlying devices changed at runtime. > > [1] https://listman.redhat.com/archives/dm-devel/2021-February/msg00028.html > > > changes since v1: > - patch 1,2,4 is the same as v1 and have already been reviewed > - patch 3 is refactored a bit on the basis of suggestions from > Mike Snitzer. > - patch 5 is newly added and introduces one new queue flag > representing if the queue is capable of IO polling. This mainly > simplifies the logic in queue_poll_store(). > - patch 6 implements the core mechanism supporting IO polling. > The sanity check checking if the dm device supports IO polling is > also folded into this patch, and the queue flag will be cleared if > it doesn't support, in case of table reloading. > > > > > Jeffle Xu (11): > block: move definition of blk_qc_t to types.h > block: add queue_to_disk() to get gendisk from request_queue > block: add poll method to support bio-based IO polling > block: add poll_capable method to support bio-based IO polling > block/mq: extract one helper function polling hw queue > block/mq: add iterator for polling hw queues > dm: always return BLK_QC_T_NONE for bio-based device > dm: fix iterate_device sanity check > dm: support IO polling for bio-based dm device > nvme/pci: don't wait for locked polling queue > dm: fastpath of bio-based polling > > block/blk-core.c | 89 +++++++++++++- > block/blk-mq.c | 27 +---- > block/blk-sysfs.c | 14 ++- > drivers/md/dm-table.c | 215 ++++++++++++++++++---------------- > drivers/md/dm.c | 90 +++++++++++--- > drivers/md/dm.h | 2 +- > drivers/nvme/host/pci.c | 4 +- > include/linux/blk-mq.h | 21 ++++ > include/linux/blk_types.h | 10 +- > include/linux/blkdev.h | 4 + > include/linux/device-mapper.h | 1 + > include/linux/fs.h | 2 +- > include/linux/types.h | 3 + > include/trace/events/kyber.h | 6 +- > 14 files changed, 333 insertions(+), 155 deletions(-) > -- Thanks, Jeffle