* Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> wrote: > On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote: > > * Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> wrote: > > > > > The throughput of pure mmap with mutex is below vs pure mmap is below: > > > > > > % change in performance of the mmap with pthread-mutex vs pure mmap > > > #threads vanilla all rwsem without optspin > > > patches > > > 1 3.0% -1.0% -1.7% > > > 5 7.2% -26.8% 5.5% > > > 10 5.2% -10.6% 22.1% > > > 20 6.8% 16.4% 12.5% > > > 40 -0.2% 32.7% 0.0% > > > > > > So with mutex, the vanilla kernel and the one without optspin both run > > > faster. This is consistent with what Peter reported. With optspin, the > > > picture is more mixed, with lower throughput at low to moderate number > > > of threads and higher throughput with high number of threads. > > > > So, going back to your orignal table: > > > > > % change in performance of the mmap with pthread-mutex vs pure mmap > > > #threads vanilla all without optspin > > > 1 3.0% -1.0% -1.7% > > > 5 7.2% -26.8% 5.5% > > > 10 5.2% -10.6% 22.1% > > > 20 6.8% 16.4% 12.5% > > > 40 -0.2% 32.7% 0.0% > > > > > > In general, vanilla and no-optspin case perform better with > > > pthread-mutex. For the case with optspin, mmap with pthread-mutex is > > > worse at low to moderate contention and better at high contention. > > > > it appears that 'without optspin' appears to be a pretty good choice - if > > it wasn't for that '1 thread' number, which, if I correctly assume is the > > uncontended case, is one of the most common usecases ... > > > > How can the single-threaded case get slower? None of the patches should > > really cause noticeable overhead in the non-contended case. That looks > > weird. > > > > It would also be nice to see the 2, 3, 4 thread numbers - those are the > > most common contention scenarios in practice - where do we see the first > > improvement in performance? > > > > Also, it would be nice to include a noise/sttdev figure, it's really hard > > to tell whether -1.7% is statistically significant. > > Ingo, > > I think that the optimistic spin changes to rwsem should enhance > performance to real workloads after all. > > In my previous tests, I was doing mmap followed immediately by > munmap without doing anything to the memory. No real workload > will behave that way and it is not the scenario that we > should optimize for. A much better approximation of > real usages will be doing mmap, then touching > the memories being mmaped, followed by munmap. That's why I asked for a working testcase to be posted ;-) Not just pseudocode - send the real .c thing please. > This changes the dynamics of the rwsem as we are now dominated by read > acquisitions of mmap sem due to the page faults, instead of having only > write acquisitions from mmap. [...] Absolutely, the page fault read case is the #1 optimization target of rwsems. > [...] In this case, any delay in write acquisitions will be costly as we > will be blocking a lot of readers. This is where optimistic spinning on > write acquisitions of mmap sem can provide a very significant boost to > the throughput. > > I change the test case to the following with writes to > the mmaped memory: > > #define MEMSIZE (1 * 1024 * 1024) > > char *testcase_description = "Anonymous memory mmap/munmap of 1MB"; > > void testcase(unsigned long long *iterations) > { > int i; > > while (1) { > char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); > assert(c != MAP_FAILED); > for (i=0; i<MEMSIZE; i+=8) { > c[i] = 0xa; > } > munmap(c, MEMSIZE); > > (*iterations)++; > } > } It would be _really_ nice to stick this into tools/perf/bench/ as: perf bench mem pagefaults or so, with a number of parallelism and workload patterns. See tools/perf/bench/numa.c for a couple of workload generators - although those are not page fault intense. So that future generations can run all these tests too and such. > I compare the throughput where I have the complete rwsem patchset > against vanilla and the case where I take out the optimistic spin patch. > I have increased the run time by 10x from my pervious experiments and do > 10 runs for each case. The standard deviation is ~1.5% so any changes > under 1.5% is statistically significant. > > % change in throughput vs the vanilla kernel. > Threads all No-optspin > 1 +0.4% -0.1% > 2 +2.0% +0.2% > 3 +1.1% +1.5% > 4 -0.5% -1.4% > 5 -0.1% -0.1% > 10 +2.2% -1.2% > 20 +237.3% -2.3% > 40 +548.1% +0.3% The tail is impressive. The early parts are important as well, but it's really hard to tell the significance of the early portion without having an sttdev column. ( "perf stat --repeat N" will give you sttdev output, in handy percentage form. ) > Now when I test the case where we acquire mutex in the > user space before mmap, I got the following data versus > vanilla kernel. There's little contention on mmap sem > acquisition in this case. > > n all No-optspin > 1 +0.8% -1.2% > 2 +1.0% -0.5% > 3 +1.8% +0.2% > 4 +1.5% -0.4% > 5 +1.1% +0.4% > 10 +1.5% -0.3% > 20 +1.4% -0.2% > 40 +1.3% +0.4% > > Thanks. A bit hard to see as there's no comparison _between_ the pthread_mutex and plain-parallel versions. No contention isn't a great result if performance suffers because it's all serialized. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>