Re: Multiplexing - good news, bad news, and a plea for help

Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx> · Mon, 19 Sep 2016 23:19:47 -0400 (EDT)

----- Original Message -----
> From: "Jeff Darcy" <jdarcy@xxxxxxxxxx>
> To: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
> Sent: Monday, September 19, 2016 6:56:39 PM
> Subject:  Multiplexing - good news, bad news,	and a plea for help
> 
> I have brick multiplexing[1] functional to the point that it passes all basic
> AFR, EC, and quota tests.  There are still some issues with tiering, and I
> wouldn't consider snapshots functional at all, but it seemed like a good
> point to see how well it works.  I ran some *very simple* tests with 20
> volumes, each 2x distribute on top of 2x replicate.
> 
> First, the good news: it worked!  Getting 80 bricks to come up in the same
> process, and then run I/O correctly across all of those, is pretty cool.
> Also, memory consumption is *way* down.  RSS size went from 1.1GB before
> (total across 80 processes) to about 400MB (one process) with multiplexing.
> Each process seems to consume approximately 8MB globally plus 5MB per
> brick, so (8+5)*80 = 1040 vs. 8+(5*80) = 408.  Just considering the amount
> of memory, this means we could support about three times as many bricks as
> before.  When memory *contention* is considered, the difference is likely to
> be even greater.
> 
> Bad news: some of our code doesn't scale very well in terms of CPU use.  To
> test performance I ran a test which would create 20,000 files across all 20
> volumes, then write and delete them, all using 100 client threads.  This is
> similar to what smallfile does, but deliberately constructed to use a
> minimum of disk space - at any given, only one file per thread (maximum)
> actually has 4KB worth of data in it.  This allows me to run it against SSDs
> or even ramdisks even with high brick counts, to factor out slow disks in a
> study of CPU/memory issues.  Here are some results and observations.
> 
> * On my first run, the multiplexed version of the test took 77% longer to run
> than the non-multiplexed version (5:42 vs. 3:13).  And that was after I'd
> done some hacking to use 16 epoll threads.  There's something a bit broken
> about trying to set that option normally, so that the value you set doesn't
> actually make it to the place that tries to spawn the threads.  Bumping this
> up further to 32 threads didn't seem to help.
> 
> * A little profiling showed me that we're spending almost all of our time in
> pthread_spin_lock.  I disabled the code to use spinlocks instead of regular
> mutexes, which immediately improved performance and also reduced CPU time by
> almost 50%.

I've a feeling that a significant chunk of this locking comes from accessing inode/fd contexts (the other being dictionary, mem-pool as you pointed out). Most of this code uses LOCK/UNLOCK macros which prefer spin_lock over mutex_lock (see definition of gf_lock_setup, which sets use_spinlocks to true if number of active processors is > 1). Also, some of the code (like posix_readv) is acquiring global locks (like fd->inode->lock) when not required (For eg., there is no reason two reads - especially if they are to independent regions - should contend. However posix_readv/posix_writev try to get fd-ctx which results in serialization. May be, we should bring in read/write locks or rcus?) 

> 
> * The next round of profiling showed that a lot of the locking is in mem-pool
> code, and a lot of that in turn is from dictionary code.  Changing the dict
> code to use malloc/free instead of mem_get/mem_put gave another noticeable
> boost.
> 
> At this point run time was down to 4:50, which is 20% better than where I
> started but still far short of non-multiplexed performance.  I can drive
> that down still further by converting more things to use malloc/free.  There
> seems to be a significant opportunity here to improve performance - even
> without multiplexing - by taking a more careful look at our
> memory-management strategies:
> 
> * Tune the mem-pool implementation to scale better with hundreds of threads.
> 
> * Use mem-pools more selectively, or even abandon them altogether.
> 
> * Try a different memory allocator such as jemalloc.
> 
> I'd certainly appreciate some help/collaboration in studying these options
> further.  It's a great opportunity to make a large impact on overall
> performance without a lot of code or specialized knowledge.  Even so,
> however, I don't think memory management is our only internal scalability
> problem.  There must be something else limiting parallelism, and quite
> severely at that.  My first guess is io-threads, so I'll be looking into
> that first, but if anybody else has any ideas please let me know.  There's
> no *good* reason why running many bricks in one process should be slower
> than running them in separate processes.  If it remains slower, then the
> limit on the number of bricks and volumes we can support will remain
> unreasonably low.  Also, the problems I'm seeing here probably don't *only*
> affect multiplexing.  Excessive memory/CPU use and poor parallelism are
> issues that we kind of need to address anyway, so if anybody has any ideas
> please let me know.

I am interested in this. Would like to improve concurrency where possible.

> 
> 
> 
> [1] http://review.gluster.org/#/c/14763/
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel