On Tue, 2012-11-20 at 18:56 +0100, Ingo Molnar wrote: > * Ingo Molnar <mingo@xxxxxxxxxx> wrote: > > > ( The 4x JVM regression is still an open bug I think - I'll > > re-check and fix that one next, no need to re-report it, > > I'm on it. ) > > So I tested this on !THP too and the combined numbers are now: > > | > [ SPECjbb multi-4x8 ] | > [ tx/sec ] v3.7 | numa/core-v16 > [ higher is better ] ----- | ------------- > | > +THP: 639k | 655k +2.5% > -THP: 510k | 517k +1.3% > > So it's not a regression anymore, regardless of whether THP is > enabled or disabled. > > The current updated table of performance results is: > > ------------------------------------------------------------------------- > [ seconds ] v3.7 AutoNUMA | numa/core-v16 [ vs. v3.7] > [ lower is better ] ----- -------- | ------------- ----------- > | > numa01 340.3 192.3 | 139.4 +144.1% > numa01_THREAD_ALLOC 425.1 135.1 | 121.1 +251.0% > numa02 56.1 25.3 | 17.5 +220.5% > | > [ SPECjbb transactions/sec ] | > [ higher is better ] | > | > SPECjbb 1x32 +THP 524k 507k | 638k +21.7% > SPECjbb 1x32 !THP 395k | 512k +29.6% > | > ----------------------------------------------------------------------- > | > [ SPECjbb multi-4x8 ] | > [ tx/sec ] v3.7 | numa/core-v16 > [ higher is better ] ----- | ------------- > | > +THP: 639k | 655k +2.5% > -THP: 510k | 517k +1.3% > > So I think I've addressed all regressions reported so far - if > anyone can still see something odd, please let me know so I can > reproduce and fix it ASAP. I can confirm single JVM JBB is working well for me. I see a 30% improvement over autoNUMA. What I can't make sense of is some perf stats (taken at 80 warehouses on 4 x WST-EX, 512GB memory): tips numa/core: 5,429,632,865 node-loads 3,806,419,082 node-load-misses(70.1%) 2,486,756,884 node-stores 2,042,557,277 node-store-misses(82.1%) 2,878,655,372 node-prefetches 2,201,441,900 node-prefetch-misses autoNUMA: 4,538,975,144 node-loads 2,666,374,830 node-load-misses(58.7%) 2,148,950,354 node-stores 1,682,942,931 node-store-misses(78.3%) 2,191,139,475 node-prefetches 1,633,752,109 node-prefetch-misses The percentage of misses is higher for numa/core. I would have expected the performance increase be due to lower "node-misses", but perhaps I am misinterpreting the perf data. One other thing I noticed was both tests are not even using all CPU (75-80%), so I suspect there's a JVM scalability issue with this workload at this number of cpu threads (80). This is a IBM JVM, so there may be some differences. I am curious if any of the others testing JBB are getting 100% cpu utilization at their warehouse peak. So, while the performance results are encouraging, I would like to correlate it with some kind of perf data that confirms why we think it's better. > > Next I'll work on making multi-JVM more of an improvement, and > I'll also address any incoming regression reports. I have issues with multiple KVM VMs running either JBB or dbench-in-tmpfs, and I suspect whatever I am seeing is similar to whatever multi-jvm in baremetal is. What I typically see is no real convergence of a single node for resource usage for any of the VMs. For example, when running 8 VMs, 10 vCPUs each, a VM may have the following resource usage: host cpu usage from cpuacct cgroup: /cgroup/cpuacct/libvirt/qemu/at-vm01 node00 node01 node02 node03 199056918180|005% 752455339099|020% 1811704146176|049% 888803723722|024% And VM memory placement in host(in pages): node00 node01 node02 node03 107566|023% 115245|025% 117807|025% 119414|025% Conversely, autoNUMA usually has 98+% for cpu and memory in one of the host nodes for each of these VMs. AutoNUMA is about 30% better in these tests. That is data for the entire run time, and "not converged" could possibly mean, "converged but moved around", but I doubt that's what happening. Here's perf data for the dbench VMs: numa/core: 468,634,508 node-loads 210,598,643 node-load-misses(44.9%) 172,735,053 node-stores 107,535,553 node-store-misses(51.1%) 208,064,103 node-prefetches 160,858,933 node-prefetch-misses autoNUMA: 666,498,425 node-loads 222,643,141 node-load-misses(33.4%) 219,003,566 node-stores 99,243,370 node-store-misses(45.3%) 315,439,315 node-prefetches 254,888,403 node-prefetch-misses These seems to make a little more sense to me, but the percentages for autoNUMA still seem a little high (but at least lower then numa/core). I need to take a manually pinned measurement to compare. > Those of you who would like to test all the latest patches are > welcome to pick up latest bits at tip:master: > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master I've been running on numa/core, but I'll switch to master and try these again. Thanks, -Andrew Theurer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>