Re: [PATCH v4 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers

Jens Axboe <axboe@xxxxxxxxx> · Thu, 16 May 2024 08:58:03 -0600

On 5/16/24 8:01 AM, Anuj gupta wrote:
> On Tue, May 14, 2024 at 1:25?PM Chenliang Li <cliang01.li@xxxxxxxxxxx> wrote:
>>
>> Registered buffers are stored and processed in the form of bvec array,
>> each bvec element typically points to a PAGE_SIZE page but can also work
>> with hugepages. Specifically, a buffer consisting of a hugepage is
>> coalesced to use only one hugepage bvec entry during registration.
>> This coalescing feature helps to save both the space and DMA-mapping time.
>>
>> However, currently the coalescing feature doesn't work for multi-hugepage
>> buffers. For a buffer with several 2M hugepages, we still split it into
>> thousands of 4K page bvec entries while in fact, we can just use a
>> handful of hugepage bvecs.
>>
>> This patch series enables coalescing registered buffers with more than
>> one hugepages. It optimizes the DMA-mapping time and saves memory for
>> these kind of buffers.
>>
>> Testing:
>>
>> The hugepage fixed buffer I/O can be tested using fio without
>> modification. The fio command used in the following test is given
>> in [1]. There's also a liburing testcase in [2]. Also, the system
>> should have enough hugepages available before testing.
>>
>> Perf diff of 8M(4 * 2M hugepages) fio randread test:
>>
>> Before          After           Symbol
>> .....................................................
>> 4.68%                           [k] __blk_rq_map_sg
>> 3.31%                           [k] dma_direct_map_sg
>> 2.64%                           [k] dma_pool_alloc
>> 1.09%                           [k] sg_next
>>                 +0.49%          [k] dma_map_page_attrs
>>
>> Perf diff of 8M fio randwrite test:
>>
>> Before          After           Symbol
>> ......................................................
>> 2.82%                           [k] __blk_rq_map_sg
>> 2.05%                           [k] dma_direct_map_sg
>> 1.75%                           [k] dma_pool_alloc
>> 0.68%                           [k] sg_next
>>                 +0.08%          [k] dma_map_page_attrs
>>
>> First three patches prepare for adding the multi-hugepage coalescing
>> into buffer registration, the 4th patch enables the feature.
>>
>> -----------------
>> Changes since v3:
>>
>> - Delete unnecessary commit message
>> - Update test command and test results
>>
>> v3 : https://lore.kernel.org/io-uring/20240514001614.566276-1-cliang01.li@xxxxxxxxxxx/T/#t
>>
>> Changes since v2:
>>
>> - Modify the loop iterator increment to make code cleaner
>> - Minor fix to the return procedure in coalesced buffer account
>> - Correct commit messages
>> - Add test cases in liburing
>>
>> v2 : https://lore.kernel.org/io-uring/20240513020149.492727-1-cliang01.li@xxxxxxxxxxx/T/#t
>>
>> Changes since v1:
>>
>> - Split into 4 patches
>> - Fix code style issues
>> - Rearrange the change of code for cleaner look
>> - Add speciallized pinned page accounting procedure for coalesced
>>   buffers
>> - Reordered the newly add fields in imu struct for better compaction
>>
>> v1 : https://lore.kernel.org/io-uring/20240506075303.25630-1-cliang01.li@xxxxxxxxxxx/T/#u
>>
>> [1]
>> fio -iodepth=64 -rw=randread(-rw=randwrite) -direct=1 -ioengine=io_uring \
>> -bs=8M -numjobs=1 -group_reporting -mem=shmhuge -fixedbufs -hugepage-size=2M \
>> -filename=/dev/nvme0n1 -runtime=10s -name=test1
>>
>> [2]
>> https://lore.kernel.org/io-uring/20240514051343.582556-1-cliang01.li@xxxxxxxxxxx/T/#u
>>
>> Chenliang Li (4):
>>   io_uring/rsrc: add hugepage buffer coalesce helpers
>>   io_uring/rsrc: store folio shift and mask into imu
>>   io_uring/rsrc: add init and account functions for coalesced imus
>>   io_uring/rsrc: enable multi-hugepage buffer coalescing
>>
>>  io_uring/rsrc.c | 217 +++++++++++++++++++++++++++++++++++++++---------
>>  io_uring/rsrc.h |  12 +++
>>  2 files changed, 191 insertions(+), 38 deletions(-)
>>
>>
>> base-commit: 59b28a6e37e650c0d601ed87875b6217140cda5d
>> --
>> 2.34.1
>>
>>
> 
> I tested this series by registering multi-hugepage buffers. The coalescing helps
> saving dma-mapping time. This is the gain observed on my setup, while running
> the fio workload shared here.
> 
> RandomRead:
> Baseline        DeltaAbs        Symbol
> .....................................................
> 3.89%            -3.62%            [k] blk_rq_map_sg
> 3.58%            -3.23%            [k] dma_direct_map_sg
> 2.25%            -2.23%            [k] sg_next
> 
> RandomWrite:
> Baseline        DeltaAbs        Symbol
> .....................................................
> 2.46%            -2.31%            [k] dma_direct_map_sg
> 2.06%            -2.05%            [k] sg_next
> 2.08%            -1.80%            [k] blk_rq_map_sg
> 
> The liburing test case shared works fine too on my setup.
> 
> Feel free to add:
> Tested-by: Anuj Gupta <anuj20.g@xxxxxxxxxxx>

It's even more dramatic here, excerpt from profiles:

    32.16%    -25.46%  [kernel.kallsyms]  [k] bio_split_rw
     8.92%     -8.38%  [kernel.kallsyms]  [k] iov_iter_is_aligned
     6.85%     -4.31%  [nvme]             [k] nvme_prep_rq.part.0
    14.71%             [kernel.kallsyms]  [k] __blk_rq_map_sg
     9.49%             [kernel.kallsyms]  [k] dma_direct_map_sg
     8.50%             [kernel.kallsyms]  [k] sg_next

some of it just shifted, but definitely a huge win. This is just using
a single drive, doing about 7GB/sec.

The change looks pretty reasonable to me. I'd love for the test cases to
try and hit corner cases, as it's really more of a functionality test
right now. We should include things like one-off huge pages, ensure we
don't coalesce where we should not, etc.

This is obviously too late for the 6.10 merge window, so there's plenty
of time to get this 100% sorted before the next kernel release.

-- 
Jens Axboe