Re: [PATCH 5/5] libnvdimm: add DMA support for pmem blk-mq

Dan Williams <dan.j.williams@xxxxxxxxx> · Thu, 3 Aug 2017 08:41:31 -0700



On Thu, Aug 3, 2017 at 1:06 AM, Johannes Thumshirn <jthumshirn@xxxxxxx> wrote:
> On Tue, Aug 01, 2017 at 10:43:30AM -0700, Dan Williams wrote:
>> On Tue, Aug 1, 2017 at 12:34 AM, Johannes Thumshirn <jthumshirn@xxxxxxx> wrote:
>> > Dave Jiang <dave.jiang@xxxxxxxxx> writes:
>> >
>> >> Adding DMA support for pmem blk reads. This provides signficant CPU
>> >> reduction with large memory reads with good performance. DMAs are triggered
>> >> with test against bio_multiple_segment(), so the small I/Os (4k or less?)
>> >> are still performed by the CPU in order to reduce latency. By default
>> >> the pmem driver will be using blk-mq with DMA.
>> >>
>> >> Numbers below are measured against pmem simulated via DRAM using
>> >> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
>> >> platform.  Keep in mind the performance for actual persistent memory
>> >> will differ.
>> >> Fio 2.21 was used.
>> >>
>> >> 64k: 1 task queuedepth=1
>> >> CPU Read:  7631 MB/s  99.7% CPU    DMA Read: 2415 MB/s  54% CPU
>> >> CPU Write: 3552 MB/s  100% CPU     DMA Write 2173 MB/s  54% CPU
>> >>
>> >> 64k: 16 tasks queuedepth=16
>> >> CPU Read: 36800 MB/s  1593% CPU    DMA Read:  29100 MB/s  607% CPU
>> >> CPU Write 20900 MB/s  1589% CPU    DMA Write: 23400 MB/s  585% CPU
>> >>
>> >> 2M: 1 task queuedepth=1
>> >> CPU Read:  6013 MB/s  99.3% CPU    DMA Read:  7986 MB/s  59.3% CPU
>> >> CPU Write: 3579 MB/s  100% CPU     DMA Write: 5211 MB/s  58.3% CPU
>> >>
>> >> 2M: 16 tasks queuedepth=16
>> >> CPU Read:  18100 MB/s 1588% CPU    DMA Read:  21300 MB/s 180.9% CPU
>> >> CPU Write: 14100 MB/s 1594% CPU    DMA Write: 20400 MB/s 446.9% CPU
>> >>
>> >> Signed-off-by: Dave Jiang <dave.jiang@xxxxxxxxx>
>> >> ---
>> >
>> > Hi Dave,
>> >
>> > The above table shows that there's a performance benefit for 2M
>> > transfers but a regression for 64k transfers, if we forget about the CPU
>> > utilization for a second.
>>
>> I don't think we can forget about cpu utilization, and I would expect
>> most users would value cpu over small bandwidth increases. Especially
>> when those numbers show efficiency like this
>>
>> +160% cpu +26% read bandwidth
>> +171% cpu -10% write bandwidth
>>
>> > Would it be beneficial to have heuristics on
>> > the transfer size that decide when to use dma and when not? You
>> > introduced this hunk:
>> >
>> > -   rc = pmem_handle_cmd(cmd);
>> > +   if (cmd->chan && bio_multiple_segments(req->bio))
>> > +      rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)));
>> > +   else
>> > +       rc = pmem_handle_cmd(cmd);
>> >
>> > Which utilizes dma for bios with multiple segments and for single
>> > segment bios you use the old path, maybe the single/multi segment logic
>> > can be amended to have something like:
>> >
>> >     if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH)
>> >        rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req));
>> >     else
>> >        rc = pmem_handle_cmd(cmd);
>> >
>> > Just something woth considering IMHO.
>>
>> The thing we want to avoid most is having the cpu stall doing nothing
>> waiting for memory access, so I think once the efficiency of adding
>> more cpu goes non-linear it's better to let the dma engine handle
>> that. The current heuristic seems to achieve that, but the worry is
>> does it over-select the cpu for cases where requests could be merged
>> into a bulkier i/o that is more suitable for dma.
>>
>> I'm not sure we want another tunable vs predictable / efficient
>> behavior by default.
>
> Sorry for my late reply, but lots of automated performance regression CI tools
> _will_ report errors because of the performance drop. Yes these may be false
> positives, but it'll cost hours to resolve them. If we have a tunable it's way
> easier to set it correctly for the use cases. And please keep in mind some
> people don't really case about CPU utilization but maxiumum throughput.

Returning more cpu to the application may also increase bandwidth
because we stop spinning on completion and spend more time issuing
I/O. I'm also worried that this threshold cross-over point is workload
and pmem media specific. The tunable should be there as an override,
but I think we need simple/sane defaults out of the gate. A "no
regressions whatsoever on any metric" policy effectively kills this
effort as far as I can see. We otherwise need quite a bit more
qualification of different workloads, these 3 fio runs is not enough.
--
To unsubscribe from this list: send the line "unsubscribe dmaengine" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html