On 08/03/2017 08:41 AM, Dan Williams wrote: > On Thu, Aug 3, 2017 at 1:06 AM, Johannes Thumshirn <jthumshirn@xxxxxxx> wrote: >> On Tue, Aug 01, 2017 at 10:43:30AM -0700, Dan Williams wrote: >>> On Tue, Aug 1, 2017 at 12:34 AM, Johannes Thumshirn <jthumshirn@xxxxxxx> wrote: >>>> Dave Jiang <dave.jiang@xxxxxxxxx> writes: >>>> >>>>> Adding DMA support for pmem blk reads. This provides signficant CPU >>>>> reduction with large memory reads with good performance. DMAs are triggered >>>>> with test against bio_multiple_segment(), so the small I/Os (4k or less?) >>>>> are still performed by the CPU in order to reduce latency. By default >>>>> the pmem driver will be using blk-mq with DMA. >>>>> >>>>> Numbers below are measured against pmem simulated via DRAM using >>>>> memmap=NN!SS. DMA engine used is the ioatdma on Intel Skylake Xeon >>>>> platform. Keep in mind the performance for actual persistent memory >>>>> will differ. >>>>> Fio 2.21 was used. >>>>> >>>>> 64k: 1 task queuedepth=1 >>>>> CPU Read: 7631 MB/s 99.7% CPU DMA Read: 2415 MB/s 54% CPU >>>>> CPU Write: 3552 MB/s 100% CPU DMA Write 2173 MB/s 54% CPU >>>>> >>>>> 64k: 16 tasks queuedepth=16 >>>>> CPU Read: 36800 MB/s 1593% CPU DMA Read: 29100 MB/s 607% CPU >>>>> CPU Write 20900 MB/s 1589% CPU DMA Write: 23400 MB/s 585% CPU >>>>> >>>>> 2M: 1 task queuedepth=1 >>>>> CPU Read: 6013 MB/s 99.3% CPU DMA Read: 7986 MB/s 59.3% CPU >>>>> CPU Write: 3579 MB/s 100% CPU DMA Write: 5211 MB/s 58.3% CPU >>>>> >>>>> 2M: 16 tasks queuedepth=16 >>>>> CPU Read: 18100 MB/s 1588% CPU DMA Read: 21300 MB/s 180.9% CPU >>>>> CPU Write: 14100 MB/s 1594% CPU DMA Write: 20400 MB/s 446.9% CPU >>>>> >>>>> Signed-off-by: Dave Jiang <dave.jiang@xxxxxxxxx> >>>>> --- >>>> >>>> Hi Dave, >>>> >>>> The above table shows that there's a performance benefit for 2M >>>> transfers but a regression for 64k transfers, if we forget about the CPU >>>> utilization for a second. >>> >>> I don't think we can forget about cpu utilization, and I would expect >>> most users would value cpu over small bandwidth increases. Especially >>> when those numbers show efficiency like this >>> >>> +160% cpu +26% read bandwidth >>> +171% cpu -10% write bandwidth >>> >>>> Would it be beneficial to have heuristics on >>>> the transfer size that decide when to use dma and when not? You >>>> introduced this hunk: >>>> >>>> - rc = pmem_handle_cmd(cmd); >>>> + if (cmd->chan && bio_multiple_segments(req->bio)) >>>> + rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req))); >>>> + else >>>> + rc = pmem_handle_cmd(cmd); >>>> >>>> Which utilizes dma for bios with multiple segments and for single >>>> segment bios you use the old path, maybe the single/multi segment logic >>>> can be amended to have something like: >>>> >>>> if (cmd->chan && bio_segments(req->bio) > PMEM_DMA_THRESH) >>>> rc = pmem_handle_cmd_dma(cmd, op_is_write(req_op(req)); >>>> else >>>> rc = pmem_handle_cmd(cmd); >>>> >>>> Just something woth considering IMHO. >>> >>> The thing we want to avoid most is having the cpu stall doing nothing >>> waiting for memory access, so I think once the efficiency of adding >>> more cpu goes non-linear it's better to let the dma engine handle >>> that. The current heuristic seems to achieve that, but the worry is >>> does it over-select the cpu for cases where requests could be merged >>> into a bulkier i/o that is more suitable for dma. >>> >>> I'm not sure we want another tunable vs predictable / efficient >>> behavior by default. >> >> Sorry for my late reply, but lots of automated performance regression CI tools >> _will_ report errors because of the performance drop. Yes these may be false >> positives, but it'll cost hours to resolve them. If we have a tunable it's way >> easier to set it correctly for the use cases. And please keep in mind some >> people don't really case about CPU utilization but maxiumum throughput. > > Returning more cpu to the application may also increase bandwidth > because we stop spinning on completion and spend more time issuing > I/O. I'm also worried that this threshold cross-over point is workload > and pmem media specific. The tunable should be there as an override, > but I think we need simple/sane defaults out of the gate. A "no > regressions whatsoever on any metric" policy effectively kills this > effort as far as I can see. We otherwise need quite a bit more > qualification of different workloads, these 3 fio runs is not enough. > I can put in a tunable knob like Johannes suggested and leave the default for that to what it is currently. And we can adjust as necessary as we do more testing. -- To unsubscribe from this list: send the line "unsubscribe dmaengine" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html