On Fri, Jul 8, 2016 at 8:02 PM, Jeff Darcy <jdarcy@xxxxxxxxxx> wrote:
> In either of these situations, one glusterfsd process on whatever peer the
> client is currently talking to will skyrocket to *nproc* cpu usage (800%,
> 1600%) and the storage cluster is essentially useless; all other clients
> will eventually try to read or write data to the overloaded peer and, when
> that happens, their connection will hang. Heals between peers hang because
> the load on the peer is around 1.5x the number of cores or more. This occurs
> in either gluster 3.6 or 3.7, is very repeatable, and happens much too
> frequently.
I have some good news and some bad news.
The good news is that features to address this are already planned for the
4.0 release. Primarily I'm referring to QoS enhancements, some parts of
which were already implemented for the bitrot daemon. I'm still working
out the exact requirements for this as a general facility, though. You
can help! :) Also, some of the work on "brick multiplexing" (multiple
bricks within one glusterfsd process) should help to prevent the thrashing
that causes a complete freeze-up.
Now for the bad news. Did I mention that these are 4.0 features? 4.0 is
not near term, and not getting any nearer as other features and releases
keep "jumping the queue" to absorb all of the resources we need for 4.0
to happen. Not that I'm bitter or anything. ;) To address your more
immediate concerns, I think we need to consider more modest changes that
can be completed in more modest time. For example:
* The load should *never* get to 1.5x the number of cores. Perhaps we
could tweak the thread-scaling code in io-threads and epoll to check
system load and not scale up (or even scale down) if system load is
already high.
* We might be able to tweak io-threads (which already runs on the
bricks and already has a global queue) to schedule requests in a
fairer way across clients. Right now it executes them in the
same order that they were read from the network.
This sounds to be an easier fix. We can make io-threads to factor in another input i.e., the client through which request came in (essentially frame->root->client) before scheduling. That should make the problem bearable at-least if not crippling. As to what algorithm to use, I think we can consider leaky bucket of bit-rot implementation or dmclock. I've not really thought deeper about the algorithm part. If the approach sounds ok, we can discuss more about algos.
That tends to
be a bit "unfair" and that should be fixed in the network code,
but that's a much harder task.
These are only weak approximations of what we really should be doing,
and will be doing in the long term, but (without making any promises)
they might be sufficient and achievable in the near term. Thoughts?
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel
Raghavendra G
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users