On Fri, Apr 14, 2017 at 12:10:27PM +0200, Jesper Dangaard Brouer wrote: > On Mon, 10 Apr 2017 14:26:16 -0700 > Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > > > On Mon, 10 Apr 2017 16:08:21 +0100 Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote: > > > > > IRQ context were excluded from using the Per-Cpu-Pages (PCP) lists caching > > > of order-0 pages in commit 374ad05ab64d ("mm, page_alloc: only use per-cpu > > > allocator for irq-safe requests"). > > > > > > This unfortunately also included excluded SoftIRQ. This hurt the performance > > > for the use-case of refilling DMA RX rings in softirq context. > > > > Out of curiosity: by how much did it "hurt"? > > > > <ruffles through the archives> > > > > Tariq found: > > > > : I disabled the page-cache (recycle) mechanism to stress the page > > : allocator, and see a drastic degradation in BW, from 47.5 G in v4.10 to > > : 31.4 G in v4.11-rc1 (34% drop). > > I've tried to reproduce this in my home testlab, using ConnectX-4 dual > 100Gbit/s. Hardware limits cause that I cannot reach 100Gbit/s, once a > memory copy is performed. (Word of warning: you need PCIe Gen3 width > 16 (which I do have) to handle 100Gbit/s, and the memory bandwidth of > the system also need something like 2x 12500MBytes/s (which is where my > system failed)). > > The mlx5 driver have a driver local page recycler, which I can see fail > between 29%-38% of the time, with 8 parallel netperf TCP_STREAMs. I > speculate adding more streams will make in fail more. To factor out > the driver recycler, I simply disable it (like I believe Tariq also did). > > With disabled-mlx5-recycler, 8 parallel netperf TCP_STREAMs: > > Baseline v4.10.0 : 60316 Mbit/s > Current 4.11.0-rc6: 47491 Mbit/s > This patch : 60662 Mbit/s > > While this patch does "fix" the performance regression, it does not > bring any noticeable improvement (as my micro-bench also indicated), > thus I feel our previous optimization is almost nullified. (p.s. It > does feel wrong to argue against my own patch ;-)). > > The reason for the current 4.11.0-rc6 regression is lock congestion on > the (per NUMA) page allocator lock, perf report show we spend 34.92% in > queued_spin_lock_slowpath (compared to top#2 copy cost of 13.81% in > copy_user_enhanced_fast_string). > The lock contention is likely due to the per-cpu allocator being bypassed. > > > then with this patch he found > > > > : It looks very good! I get line-rate (94Gbits/sec) with 8 streams, in > > : comparison to less than 55Gbits/sec before. > > > > Can I take this to mean that the page allocator's per-cpu-pages feature > > ended up doubling the performance of this driver? Better than the > > driver's private page recycling? I'd like to believe that, but am > > having trouble doing so ;) > > I would not conclude that. I'm also very suspicious about such big > performance "jumps". Tariq should also benchmark with v4.10 and a > disabled mlx5-recycler, as I believe the results should be the same as > after this patch. > > That said, it is possible to see a regression this large, when all the > CPUs are congesting on the page allocator lock. AFAIK Tariq also > mentioned seeing 60% spend on the lock, which would confirm this theory. > On that basis, I've posted a revert of the original patch which should either go into 4.11 or 4.11-stable. Andrew, the revert should also remove the "re-enable softirq use of per-cpu page" patch from mmotm. Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>