----- Original Message ----- > From: "Jeff Darcy" <jdarcy@xxxxxxxxxx> > To: "Gluster Devel" <gluster-devel@xxxxxxxxxxx> > Sent: Monday, September 19, 2016 6:56:39 PM > Subject: Multiplexing - good news, bad news, and a plea for help > > I have brick multiplexing[1] functional to the point that it passes all basic > AFR, EC, and quota tests. There are still some issues with tiering, and I > wouldn't consider snapshots functional at all, but it seemed like a good > point to see how well it works. I ran some *very simple* tests with 20 > volumes, each 2x distribute on top of 2x replicate. > > First, the good news: it worked! Getting 80 bricks to come up in the same > process, and then run I/O correctly across all of those, is pretty cool. > Also, memory consumption is *way* down. RSS size went from 1.1GB before > (total across 80 processes) to about 400MB (one process) with multiplexing. > Each process seems to consume approximately 8MB globally plus 5MB per > brick, so (8+5)*80 = 1040 vs. 8+(5*80) = 408. Just considering the amount > of memory, this means we could support about three times as many bricks as > before. When memory *contention* is considered, the difference is likely to > be even greater. > > Bad news: some of our code doesn't scale very well in terms of CPU use. To > test performance I ran a test which would create 20,000 files across all 20 > volumes, then write and delete them, all using 100 client threads. This is > similar to what smallfile does, but deliberately constructed to use a > minimum of disk space - at any given, only one file per thread (maximum) > actually has 4KB worth of data in it. This allows me to run it against SSDs > or even ramdisks even with high brick counts, to factor out slow disks in a > study of CPU/memory issues. Here are some results and observations. > > * On my first run, the multiplexed version of the test took 77% longer to run > than the non-multiplexed version (5:42 vs. 3:13). And that was after I'd > done some hacking to use 16 epoll threads. There's something a bit broken > about trying to set that option normally, so that the value you set doesn't > actually make it to the place that tries to spawn the threads. Bumping this > up further to 32 threads didn't seem to help. > > * A little profiling showed me that we're spending almost all of our time in > pthread_spin_lock. I disabled the code to use spinlocks instead of regular > mutexes, which immediately improved performance and also reduced CPU time by > almost 50%. I've a feeling that a significant chunk of this locking comes from accessing inode/fd contexts (the other being dictionary, mem-pool as you pointed out). Most of this code uses LOCK/UNLOCK macros which prefer spin_lock over mutex_lock (see definition of gf_lock_setup, which sets use_spinlocks to true if number of active processors is > 1). Also, some of the code (like posix_readv) is acquiring global locks (like fd->inode->lock) when not required (For eg., there is no reason two reads - especially if they are to independent regions - should contend. However posix_readv/posix_writev try to get fd-ctx which results in serialization. May be, we should bring in read/write locks or rcus?) > > * The next round of profiling showed that a lot of the locking is in mem-pool > code, and a lot of that in turn is from dictionary code. Changing the dict > code to use malloc/free instead of mem_get/mem_put gave another noticeable > boost. > > At this point run time was down to 4:50, which is 20% better than where I > started but still far short of non-multiplexed performance. I can drive > that down still further by converting more things to use malloc/free. There > seems to be a significant opportunity here to improve performance - even > without multiplexing - by taking a more careful look at our > memory-management strategies: > > * Tune the mem-pool implementation to scale better with hundreds of threads. > > * Use mem-pools more selectively, or even abandon them altogether. > > * Try a different memory allocator such as jemalloc. > > I'd certainly appreciate some help/collaboration in studying these options > further. It's a great opportunity to make a large impact on overall > performance without a lot of code or specialized knowledge. Even so, > however, I don't think memory management is our only internal scalability > problem. There must be something else limiting parallelism, and quite > severely at that. My first guess is io-threads, so I'll be looking into > that first, but if anybody else has any ideas please let me know. There's > no *good* reason why running many bricks in one process should be slower > than running them in separate processes. If it remains slower, then the > limit on the number of bricks and volumes we can support will remain > unreasonably low. Also, the problems I'm seeing here probably don't *only* > affect multiplexing. Excessive memory/CPU use and poor parallelism are > issues that we kind of need to address anyway, so if anybody has any ideas > please let me know. I am interested in this. Would like to improve concurrency where possible. > > > > [1] http://review.gluster.org/#/c/14763/ > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx > http://www.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel