Jeff,
If I understood brick-multiplexing correctly, add-brick/remove-brick add/remove graphs right? I don't think the grah-cleanup is in good shape, i.e. it should lead to memory leaks etc. Did you get a chance to think about it?On Mon, Sep 19, 2016 at 6:56 PM, Jeff Darcy <jdarcy@xxxxxxxxxx> wrote:
I have brick multiplexing[1] functional to the point that it passes all basic AFR, EC, and quota tests. There are still some issues with tiering, and I wouldn't consider snapshots functional at all, but it seemed like a good point to see how well it works. I ran some *very simple* tests with 20 volumes, each 2x distribute on top of 2x replicate.
First, the good news: it worked! Getting 80 bricks to come up in the same process, and then run I/O correctly across all of those, is pretty cool. Also, memory consumption is *way* down. RSS size went from 1.1GB before (total across 80 processes) to about 400MB (one process) with multiplexing. Each process seems to consume approximately 8MB globally plus 5MB per brick, so (8+5)*80 = 1040 vs. 8+(5*80) = 408. Just considering the amount of memory, this means we could support about three times as many bricks as before. When memory *contention* is considered, the difference is likely to be even greater.
Bad news: some of our code doesn't scale very well in terms of CPU use. To test performance I ran a test which would create 20,000 files across all 20 volumes, then write and delete them, all using 100 client threads. This is similar to what smallfile does, but deliberately constructed to use a minimum of disk space - at any given, only one file per thread (maximum) actually has 4KB worth of data in it. This allows me to run it against SSDs or even ramdisks even with high brick counts, to factor out slow disks in a study of CPU/memory issues. Here are some results and observations.
* On my first run, the multiplexed version of the test took 77% longer to run than the non-multiplexed version (5:42 vs. 3:13). And that was after I'd done some hacking to use 16 epoll threads. There's something a bit broken about trying to set that option normally, so that the value you set doesn't actually make it to the place that tries to spawn the threads. Bumping this up further to 32 threads didn't seem to help.
* A little profiling showed me that we're spending almost all of our time in pthread_spin_lock. I disabled the code to use spinlocks instead of regular mutexes, which immediately improved performance and also reduced CPU time by almost 50%.
* The next round of profiling showed that a lot of the locking is in mem-pool code, and a lot of that in turn is from dictionary code. Changing the dict code to use malloc/free instead of mem_get/mem_put gave another noticeable boost.
At this point run time was down to 4:50, which is 20% better than where I started but still far short of non-multiplexed performance. I can drive that down still further by converting more things to use malloc/free. There seems to be a significant opportunity here to improve performance - even without multiplexing - by taking a more careful look at our memory-management strategies:
* Tune the mem-pool implementation to scale better with hundreds of threads.
* Use mem-pools more selectively, or even abandon them altogether.
* Try a different memory allocator such as jemalloc.
I'd certainly appreciate some help/collaboration in studying these options further. It's a great opportunity to make a large impact on overall performance without a lot of code or specialized knowledge. Even so, however, I don't think memory management is our only internal scalability problem. There must be something else limiting parallelism, and quite severely at that. My first guess is io-threads, so I'll be looking into that first, but if anybody else has any ideas please let me know. There's no *good* reason why running many bricks in one process should be slower than running them in separate processes. If it remains slower, then the limit on the number of bricks and volumes we can support will remain unreasonably low. Also, the problems I'm seeing here probably don't *only* affect multiplexing. Excessive memory/CPU use and poor parallelism are issues that we kind of need to address anyway, so if anybody has any ideas please let me know.
[1] http://review.gluster.org/#/c/14763/
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel
--
Pranith
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel