On Mon, Jan 17, 2022 at 10:27:01AM +0000, Mel Gorman wrote: > > 1) You're right. When options "noverify=1" and "polling=1" are used. > > then no performance reducing occurs. > > How about just noverify=1 on its own? It's a stronger indicator that > cache hotness is a factor. > With "noverify=1 polled=0" the performance reduction is only 10-20%, but still exists. -----< v5.15.8-vanilla >----- [17057.866760] dmatest: Added 1 threads using dma0chan0 [17060.133880] dmatest: Started 1 threads using dma0chan0 [17060.154343] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 49338.85 iops 3157686 KB/s (0) [17063.737887] dmatest: Added 1 threads using dma0chan0 [17065.113838] dmatest: Started 1 threads using dma0chan0 [17065.137659] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 42183.41 iops 2699738 KB/s (0) [17100.339989] dmatest: Added 1 threads using dma0chan0 [17102.190764] dmatest: Started 1 threads using dma0chan0 [17102.214285] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 42844.89 iops 2742073 KB/s (0) -----< end >----- -----< 5.15.8-ioat-ptdma-dirty-fix+ >----- [ 6183.356549] dmatest: Added 1 threads using dma0chan0 [ 6187.868237] dmatest: Started 1 threads using dma0chan0 [ 6187.887389] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 52753.74 iops 3376239 KB/s (0) [ 6201.913154] dmatest: Added 1 threads using dma0chan0 [ 6204.701340] dmatest: Started 1 threads using dma0chan0 [ 6204.720490] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 52614.96 iops 3367357 KB/s (0) [ 6285.114603] dmatest: Added 1 threads using dma0chan0 [ 6287.031875] dmatest: Started 1 threads using dma0chan0 [ 6287.050278] dmatest: dma0chan0-copy0: summary 1000 tests, 0 failures 54939.01 iops 3516097 KB/s (0) -----< end >----- > > 2) DMA Engine on certain devices, e.g. Switchtec DMA and AMD PTDMA, is > > used particularly for off-CPU data transfer via device's NTB to a remote > > host. In NTRDMA project, which I'm involved to, DMA Engine sends data to > > remote ring buffer and on data arrival CPU processes local ring buffers. > > > > Is there any impact of the patch in this case? Given that it's a remote > host, the data is likely cache cold anyway. > It's complicated. Currently we have a bunch of problems with the project. So we do decomposition and try to solve them separately. Here we faced the DMA Engine issue. > > 4) Do you mean that with noverify=N and dirty patch, data verification > > is performed on cached data and thus measured performance is fake? > > > > I think it's the data verification going slower because the tasks are > not aggressively migrating on interrupt. The flip side is other > interrupts such as IO completion should not migrate the tasks given that > the interrupt is not necessarily correlated with data hotness. > It's quite strange, because dmatest substitutes verification time from overall test time. I suspect measurement may be inaccurate. > > 5) What DMA Engine enabled drivers (and dmatest) should use as design > > pattern to conform migration/cache behavior? Does scheduler optimisation > > conflict to DMA Engine performance in general? > > > > I'm not familiar with DMA engine drivers but if they use wake_up > interfaces then passing WF_SYNC or calling the wake_up_*_sync helpers > may force the migration. > Thanks for the advice. I'll try to check if this is a solution. -- Regards, Alexander