On 5/16/24 8:01 AM, Anuj gupta wrote: > On Tue, May 14, 2024 at 1:25?PM Chenliang Li <cliang01.li@xxxxxxxxxxx> wrote: >> >> Registered buffers are stored and processed in the form of bvec array, >> each bvec element typically points to a PAGE_SIZE page but can also work >> with hugepages. Specifically, a buffer consisting of a hugepage is >> coalesced to use only one hugepage bvec entry during registration. >> This coalescing feature helps to save both the space and DMA-mapping time. >> >> However, currently the coalescing feature doesn't work for multi-hugepage >> buffers. For a buffer with several 2M hugepages, we still split it into >> thousands of 4K page bvec entries while in fact, we can just use a >> handful of hugepage bvecs. >> >> This patch series enables coalescing registered buffers with more than >> one hugepages. It optimizes the DMA-mapping time and saves memory for >> these kind of buffers. >> >> Testing: >> >> The hugepage fixed buffer I/O can be tested using fio without >> modification. The fio command used in the following test is given >> in [1]. There's also a liburing testcase in [2]. Also, the system >> should have enough hugepages available before testing. >> >> Perf diff of 8M(4 * 2M hugepages) fio randread test: >> >> Before After Symbol >> ..................................................... >> 4.68% [k] __blk_rq_map_sg >> 3.31% [k] dma_direct_map_sg >> 2.64% [k] dma_pool_alloc >> 1.09% [k] sg_next >> +0.49% [k] dma_map_page_attrs >> >> Perf diff of 8M fio randwrite test: >> >> Before After Symbol >> ...................................................... >> 2.82% [k] __blk_rq_map_sg >> 2.05% [k] dma_direct_map_sg >> 1.75% [k] dma_pool_alloc >> 0.68% [k] sg_next >> +0.08% [k] dma_map_page_attrs >> >> First three patches prepare for adding the multi-hugepage coalescing >> into buffer registration, the 4th patch enables the feature. >> >> ----------------- >> Changes since v3: >> >> - Delete unnecessary commit message >> - Update test command and test results >> >> v3 : https://lore.kernel.org/io-uring/20240514001614.566276-1-cliang01.li@xxxxxxxxxxx/T/#t >> >> Changes since v2: >> >> - Modify the loop iterator increment to make code cleaner >> - Minor fix to the return procedure in coalesced buffer account >> - Correct commit messages >> - Add test cases in liburing >> >> v2 : https://lore.kernel.org/io-uring/20240513020149.492727-1-cliang01.li@xxxxxxxxxxx/T/#t >> >> Changes since v1: >> >> - Split into 4 patches >> - Fix code style issues >> - Rearrange the change of code for cleaner look >> - Add speciallized pinned page accounting procedure for coalesced >> buffers >> - Reordered the newly add fields in imu struct for better compaction >> >> v1 : https://lore.kernel.org/io-uring/20240506075303.25630-1-cliang01.li@xxxxxxxxxxx/T/#u >> >> [1] >> fio -iodepth=64 -rw=randread(-rw=randwrite) -direct=1 -ioengine=io_uring \ >> -bs=8M -numjobs=1 -group_reporting -mem=shmhuge -fixedbufs -hugepage-size=2M \ >> -filename=/dev/nvme0n1 -runtime=10s -name=test1 >> >> [2] >> https://lore.kernel.org/io-uring/20240514051343.582556-1-cliang01.li@xxxxxxxxxxx/T/#u >> >> Chenliang Li (4): >> io_uring/rsrc: add hugepage buffer coalesce helpers >> io_uring/rsrc: store folio shift and mask into imu >> io_uring/rsrc: add init and account functions for coalesced imus >> io_uring/rsrc: enable multi-hugepage buffer coalescing >> >> io_uring/rsrc.c | 217 +++++++++++++++++++++++++++++++++++++++--------- >> io_uring/rsrc.h | 12 +++ >> 2 files changed, 191 insertions(+), 38 deletions(-) >> >> >> base-commit: 59b28a6e37e650c0d601ed87875b6217140cda5d >> -- >> 2.34.1 >> >> > > I tested this series by registering multi-hugepage buffers. The coalescing helps > saving dma-mapping time. This is the gain observed on my setup, while running > the fio workload shared here. > > RandomRead: > Baseline DeltaAbs Symbol > ..................................................... > 3.89% -3.62% [k] blk_rq_map_sg > 3.58% -3.23% [k] dma_direct_map_sg > 2.25% -2.23% [k] sg_next > > RandomWrite: > Baseline DeltaAbs Symbol > ..................................................... > 2.46% -2.31% [k] dma_direct_map_sg > 2.06% -2.05% [k] sg_next > 2.08% -1.80% [k] blk_rq_map_sg > > The liburing test case shared works fine too on my setup. > > Feel free to add: > Tested-by: Anuj Gupta <anuj20.g@xxxxxxxxxxx> It's even more dramatic here, excerpt from profiles: 32.16% -25.46% [kernel.kallsyms] [k] bio_split_rw 8.92% -8.38% [kernel.kallsyms] [k] iov_iter_is_aligned 6.85% -4.31% [nvme] [k] nvme_prep_rq.part.0 14.71% [kernel.kallsyms] [k] __blk_rq_map_sg 9.49% [kernel.kallsyms] [k] dma_direct_map_sg 8.50% [kernel.kallsyms] [k] sg_next some of it just shifted, but definitely a huge win. This is just using a single drive, doing about 7GB/sec. The change looks pretty reasonable to me. I'd love for the test cases to try and hit corner cases, as it's really more of a functionality test right now. We should include things like one-off huge pages, ensure we don't coalesce where we should not, etc. This is obviously too late for the 6.10 merge window, so there's plenty of time to get this 100% sorted before the next kernel release. -- Jens Axboe