I spent yesterday following up on one of the items from the Berlin summit: memory usage and allocation-related performance. Besides being a general concern, this is a very specific worry with multiplexing because some of our code in this area scales very poorly to the hundreds of threads per process that multiplexing can create. For example, one fairly straightforward performance test (detailed later) ran so slowly at first that I stopped it after it had run for three times as long as without multiplexing, and AFAICT it hadn't even reached its half-way point. Even with some fairly significant tweaking (mostly turning spinlocks into mutexes and raising various thread counts), it still ran 30% slower than normal. Hence, this investigation. ** BASELINE TESTING ** The test I used for most of this is fairly simple. First, I create 20 volumes, each 2x2 so 80 bricks total. Then, a first pass of the test creates 1000 empty files on each volume. Lastly, a second pass writes (synchronously) and then unlinks each file. The test is deliberately constructed so that it doesn't use a ton of disk space, which allows it to run on ramdisks and isolate CPU/memory performance issues instead of always waiting for disk. Because each ramdisk is served by only a single kernel thread[1], using many of those is also important. On my test machine, using ramdisks, this test usually runs in about five minutes. That's a good length - long enough to provide meaningful results but short enough to allow rapid iteration on tuning or new code. In fact, with current code from master and no tweaking at all, it takes *exactly* five minutes. By converting GF_LOCK from a spinlock to a mutex I can get a modest gain, down to 4:41. As I said before, this test effectively couldn't even run with multiplexing unless the GF_LOCK change was present, so for the remainder of this document just assume it's there. My testing was also done with a version of io-threads that shares its thread pool across all instances (i.e. bricks) instead of allocating a separate pool for each. This was a result of previous multiplexing-performance work, and again necessary for tests to run in reasonable time. With multiplexing and no other tuning, the time is 6:46. This is where I started profiling, and found the following chain: Most time was still spent in locking. Most of that was from memory-pool code. Most of that was from dictionary code, though a couple of other culprits (call stubs and stacks/frames) were also noticeable. [1] Fun fact: there used to be a multi-threaded version of the loopback driver, before 2.6 IIRC. Then, for reasons that are unclear and probably very stupid, a Red Hat engineer (someone whose name many of you would know) reverted it to being single-threaded. This particular piece of kernel lameness was a significant pain point at my last job, but I've digressed too long already to explain why. ** EASY EXPERIMENTS ** With the above observations in mind, I tried two experiments. First, I tried disabling memory pools entirely - i.e. just call through to malloc/free instead. This got the time down to 5:59. As an alternative, just changing some of the major memory-pool users mentioned above to call malloc/free directly (leaving them in place for other users) improved things a little bit more. Modified dictionary and call-stub code: 6:14 Modified stack/frame code as well: 5:51 The first result is faster than we started, but still slower than disabling memory pools entirely. The second result is faster still. My conclusion was that memory pools are still hurting more than they help, but the benefit from avoiding GF_*ALLOC/GF_FREE as well in these three cases outweighed the ongoing cost of leaving memory pools in place for everything else. Put another way, some hypothetical memory-pool approach might help by GF_*ALLOC/GF_FREE overhead, but the implementation we have is insufficient. I also tried an experiment with *both* these changes and memory-pools fully disabled. This yielded a time of 5:48, further supporting my theory. At this point, since the result relies more heavily on malloc/free performance than before, it seemed like a good time to try jemalloc. By accident, I did this with a version that only included the dictionary and call-stub code but not the stack/frame code. This resulted in only a very modest improvement - 6:11 vs. 6:14 before. Glad I tried, but so much for that experiment. ** NEW MEMORY POOL DESIGN ** With all of this preliminary work out of the way, it finally seemed like it was worth trying to improve the memory-pool code. For this, I relied on a design I've used successfully elsewhere: Within each pool (e.g. dictionary objects) each thread gets its own private sub-pool from which it can draw. Each per-thread pool is divided into hot and cold lists. Every allocation first attempts to use the hot list, then the cold list. When objects are freed, they always go on the hot list. Each global pool has a "pool sweeper" thread, which will wake up every N seconds (by default N=10). For each per-thread pool, it will delete everything on the cold list and move the hot list to the cold list (hot list is now empty). This design has several good properties. Very low lock contention. Per-thread pool locks have to exist for correctness, but these locks are never contended between normal threads. The only contention is with the pool sweeper, and that's still *very* rare. On a busy system, visits to the system memory allocator will tend toward zero as objects are all found on the per-thread hot list. Even on a system with cyclic behavior, where the cycle is less than N seconds, system-memory-allocator visits will still tend toward zero as items are found on the per-thread cold list. Memory usage at idle tends toward zero after 2N seconds, as every object gets moved to the cold list on one pool-sweeper iteration and then deleted on the next. With a "last in, first out" implementation, locality of reference and cache warmth (assuming a decent thread scheduler) are very good. ** CONCLUSION ** With an implementation of the above design, time was down to 5:34. The fact that this is better than both the memory-pool-disabled time of 5:59 and the modify-common-callers result of of 5:51 indicates that we've crossed the point of memory pools providing an overall benefit in terms of reducing calls to either GF_*ALLOC or plain malloc. I haven't tried this without multiplexing yet, but since the differences there were small anyway I expect they'd provide a modest improvement. _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel