Re: [PATCH v3 00/11] dm: support IO polling

JeffleXu <jefflexu@xxxxxxxxxxxxxxxxx> · Wed, 17 Feb 2021 21:15:32 +0800

Hi,

Any comment on this series?

Thanks,
Jeffle

On 2/8/21 4:52 PM, Jeffle Xu wrote:
> 
> [Performance]
> 1. One thread (numjobs=1) randread (bs=4k, direct=1) one dm-linear
> device, which is built upon 3 nvme devices, with one polling hw
> queue per nvme device.
> 
>      | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ---- | --------------- | -------------------- | ----
>   dm | 		  208k |		 279k | ~34%
> 
> 
> 2. Three threads (numjobs=3) randread (bs=4k, direct=1) one dm-linear
> device, which is built upon 3 nvme devices, with one polling hw
> queue per nvme device.
> 
> It's compared to 3 threads directly randread 3 nvme devices, with one
> polling hw queue per nvme device. No CPU affinity set for these 3
> threads. Thus every thread can access every nvme device
> (filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1), i.e., every thread
> need to competing for every polling hw queue.
> 
>      | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ---- | --------------- | -------------------- | ----
>   dm | 		  615k |		 728k | ~18%
> nvme | 		  728k |		 873k | ~20%
> 
> The result shows that the performance gain of bio-based polling is
> comparable as that of mq polling in the same test case.
> 
> 
> 3. Three threads (numjobs=3) randread (bs=12k, direct=1) one
> **dm-stripe** device, which is built upon 3 nvme devices, with one
> polling hw queue per nvme device.
> 
> It's compared to 3 threads directly randread 3 nvme devices, with one
> polling hw queue per nvme device. No CPU affinity set for these 3
> threads. Thus every thread can access every nvme device
> (filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1), i.e., every thread
> need to competing for every polling hw queue.
> 
>      | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ---- | --------------- | -------------------- | ----
>   dm | 		  314k |		 354k | ~13%
> nvme | 		  728k |		 873k | ~20%
> 
> 
> 4. This patchset shall do no harm to the performance of the original mq
> polling. Following is the test results of one thread (numjobs=1)
> randread (bs=4k, direct=1) one nvme device.
> 
> 	    	 | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ---------------- | --------------- | -------------------- | ----
> without patchset | 	      242k |		     332k | ~39%
> with patchset    |	      236k |		     332k | ~39%
> 
> 
> 
> [Changes since v2]
> 
> Patchset v2 caches all hw queues (in polling mode) of underlying mq
> devices in dm layer. The polling routine actually iterates through all
> these cached hw queues.
> 
> However, mq may change the queue mapping at runtime (e.g., NVMe RESET
> command), thus the cached hw queues in dm layer may be out-of-date. Thus
> patchset v3 falls back to the implementation of the very first RFC
> version, in which the mq layer needs to export one interface iterating
> all polling hw queues (patch 5), and the bio-based polling routine just
> calls this interface to iterate all polling hw queues.
> 
> Besides, several new optimization is proposed.
> 
> 
> - patch 1,2,7
> same as v2, untouched
> 
> - patch 3
> Considering advice from Christoph Hellwig, while refactoring blk_poll(),
> split mq and bio-based polling routine from the very beginning. Now
> blk_poll() is just a simple entry. blk_bio_poll() is simply copied from
> blk_mq_poll(), while the loop structure is some sort of duplication
> though.
> 
> - patch 4
> This patch is newly added to support turning on/off polling through
> '/sys/block/<dev>/queue/io_poll' dynamiclly for bio-based devices.
> Patchset v2 implemented this functionality by added one new queue flag,
> which is not preferred since the queue flag resource is quite short of
> nowadays.
> 
> - patch 5
> This patch is newly added, preparing for the following bio-based
> polling. The following bio-based polling will call this helper function,
> accounting on the corresponding hw queue.
> 
> - patch 6
> It's from the very first RFC version, preparing for the following
> bio-based polling.
> 
> - patch 8
> One fixing patch needed by the following bio-based polling. It's
> actually a v2 of [1]. I had sent the v2 singly in-reply-to [1], though
> it has not been visible on the mailing list maybe due to the delay.
> 
> - patch 9
> It's from the very first RFC version.
> 
> - patch 10
> This patch is newly added. Patchset v2 had ever proposed one
> optimization that, skipping the **busy** hw queues during the iteration
> phase. Back upon that time, one flag of 'atomic_t' is specifically
> maintained in dm layer, representing if the corresponding hw queue is
> busy or not. The idea is inherited, while the implementation changes.
> Now @nvmeq->cq_poll_lock is used directly here, no need for extra flag
> anymore.
> 
> This optimization can significantly reduce the competition for one hw
> queue between multiple polling instances. Following statistics is the
> test result when 3 threads concurrently randread (bs=4k, direct=1) one
> dm-linear device, which is built upon 3 nvme devices, with one polling
> hw queue per nvme device.
> 
> 	    | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
> ----------- | --------------- | -------------------- | ----
> without opt | 		 318k |		 	256k | ~-20%
> with opt    |		 314k |		 	354k | ~13%
> 							
> 
> - patch 11
> This is another newly added optimizatin for bio-based polling.
> 
> One intuitive insight is that, when the original bio submitted to dm
> device doesn't get split, then the bio gets enqueued into only one hw
> queue of one of the underlying mq devices. In this case, we no longer
> need to track all split bios, and one cookie (for the only split bio)
> is enough. It is implemented by returning the pointer to the
> corresponding hw queue in this case.
> 
> It should be safe by directly returning the pointer to the hw queue,
> since 'struct blk_mq_hw_ctx' won't be freed during the whole lifetime of
> 'struct request_queue'. Even when the number of hw queues may decrease
> when NVMe RESET happens, the 'struct request_queue' structure of decreased
> hw queues won't be freed, instead it's buffered into
> &q->unused_hctx_list list.
> 
> Though this optimization seems quite intuitive, the performance test
> shows that it does no benefit nor harm to the performance, while 3
> threads concurrently randreading (bs=4k, direct=1) one dm-linear
> device, which is built upon 3 nvme devices, with one polling hw queue
> per nvme device.
> 
> I'm not sure why it doesn't work, maybe because the number of devices,
> or the depth of the devcice stack is to low in my test case?
> 
> 
> [Remained Issue]
> It has been mentioned in patch 4 that, users could change the state of
> the underlying devices through '/sys/block/<dev>/io_poll', bypassing
> the dm device above. Thus it can cause a situation where QUEUE_FLAG_POLL
> is still set for the request_queue of dm device, while one of the
> underlying mq device may has cleared this flag.
> 
> In this case, it will pass the 'test_bit(QUEUE_FLAG_POLL, &q->queue_flags)'
> check in blk_poll(), while the input cookie may actually points to a hw
> queue in IRQ mode since patch 11. Thus for this hw queue (in IRQ mode),
> the bio-based polling routine will handle this hw queue acquiring
> 'spin_lock(&nvmeq->cq_poll_lock)' (refer
> drivers/nvme/host/pci.c:nvme_poll), which is not adequate since this hw
> queue may also be accessed in IRQ context. In other words,
> spin_lock_irq() should be used here.
> 
> I have not come up one simple way to fix it. I don't want to do sanity
> check (e.g., the type of the hw queue is HCTX_TYPE_POLL or not) in the
> IO path (submit_bio()/blk_poll()), i.e., fast path.
> 
> We'd better fix it in the control path, i.e., dm could be aware of the
> change when attribute (e.g., support io_poll or not) of one of the
> underlying devices changed at runtime.
> 
> [1] https://listman.redhat.com/archives/dm-devel/2021-February/msg00028.html
> 
> 
> changes since v1:
> - patch 1,2,4 is the same as v1 and have already been reviewed
> - patch 3 is refactored a bit on the basis of suggestions from
> Mike Snitzer.
> - patch 5 is newly added and introduces one new queue flag
> representing if the queue is capable of IO polling. This mainly
> simplifies the logic in queue_poll_store().
> - patch 6 implements the core mechanism supporting IO polling.
> The sanity check checking if the dm device supports IO polling is
> also folded into this patch, and the queue flag will be cleared if
> it doesn't support, in case of table reloading.
> 
> 
> 
> 
> Jeffle Xu (11):
>   block: move definition of blk_qc_t to types.h
>   block: add queue_to_disk() to get gendisk from request_queue
>   block: add poll method to support bio-based IO polling
>   block: add poll_capable method to support bio-based IO polling
>   block/mq: extract one helper function polling hw queue
>   block/mq: add iterator for polling hw queues
>   dm: always return BLK_QC_T_NONE for bio-based device
>   dm: fix iterate_device sanity check
>   dm: support IO polling for bio-based dm device
>   nvme/pci: don't wait for locked polling queue
>   dm: fastpath of bio-based polling
> 
>  block/blk-core.c              |  89 +++++++++++++-
>  block/blk-mq.c                |  27 +----
>  block/blk-sysfs.c             |  14 ++-
>  drivers/md/dm-table.c         | 215 ++++++++++++++++++----------------
>  drivers/md/dm.c               |  90 +++++++++++---
>  drivers/md/dm.h               |   2 +-
>  drivers/nvme/host/pci.c       |   4 +-
>  include/linux/blk-mq.h        |  21 ++++
>  include/linux/blk_types.h     |  10 +-
>  include/linux/blkdev.h        |   4 +
>  include/linux/device-mapper.h |   1 +
>  include/linux/fs.h            |   2 +-
>  include/linux/types.h         |   3 +
>  include/trace/events/kyber.h  |   6 +-
>  14 files changed, 333 insertions(+), 155 deletions(-)
> 

-- 
Thanks,
Jeffle