Re: Objections to multi-thread-epoll and proposal to use own-thread alternative

Joe Julian <joe@xxxxxxxxxxxxxxxx> · Tue, 14 Oct 2014 09:34:29 -0700

On 10/14/2014 08:23 AM, Jeff Darcy wrote:
We should try comparing performance of multi-thread-epoll to
own-thread, shouldn't be hard to hack own-thread into non-SSL-socket
case.
Own-thread has always been available on non-SSL sockets, from the day it
was first implemented as part of HekaFS.

HOWEVER, if "own-thread" implies a thread per network connection, as
you scale out a Gluster volume with N bricks, you have O(N) clients,
and therefore you have O(N) threads on each glusterfsd (libgfapi
adoption would make it far worse)!  Suppose we are implementing a
64-brick configuration with 200 clients, not an unreasonably sized
Gluster volume for a scalable filesystem.   We then have 200 threads
per Glusterfsd just listening for RPC messages on each brick.  On a
60-drive server there can be a lot more than 1 brick per server, so
multiply threads/glusterfsd by brick count!  It doesn't make sense to
have total threads >= CPUs, and modern processors make context
switching between threads more and more expensive.
It doesn't make sense to have total *busy* threads >= cores (not CPUs)
because of context switches, but idle threads are very low-cost.  Also,
note that multi-threaded epoll is also not free from context-switch
issues.  The real problem with either approach is "whale" servers with
large numbers of bricks apiece, vs. "piranha" servers with relatively
few.  That's an unbalanced system, with too little CPU and memory (and
probably disk/network bandwidth) relative to capacity.

This is where we engineers come in to play. Given a set of parameters 
it's now our job to build the system to suit our use case. If the 
"whale" server does not suit it, we shouldn't be building it. Conversely 
if performance is not an issue but instead cost density is, we can build 
those whales and be happy with them. Simply document the design 
sufficiently so we can make those decisions.

That said, I've already conceded that there are probably cases where
multi-threaded epoll will generate more parallelism than own-thread.
However, that only matters up to the point where we hit some other
bottleneck.  The question is whether the difference is apparent *to the
user* for any configuration and workload we can actually test.  Only
after we have that answer can we evaluate whether the benefit is greater
than the risk (of uncovering even more race conditions in other
components) and the drawback of being unable to support SSL.

Shyam mentioned a refinement to own-thread where we equally partition
the set of TCP connections among a pool of threads (own-thread is a
special case of this).
Some form of this would dovetail very nicely with the idea of
multiplexing multiple bricks onto a single glusterfsd process, which we
need to do for other reasons.

On the Gluster server side, because of the io-threads translator, an
RPC listener thread is effectively just starting a worker thread and
then going back to read another RPC.  With own-thread, although RPC
requests are received in order, there is no guarantee that the
requests will be processed in the order that they were received from
the network.   On the client side, we have operations such as readdir
that will fan out parallel FOPS.  If you use own-thread approach, then
these parallel FOP replies can all be processed in parallel by the
listener threads, so you get at least the same level of race condition
that you would get with multi-thread-epoll.
You get some race conditions, but not to the same level.  As you've
already pointed out yourself, multi-threaded epoll can generate greater
parallelism even among requests arriving on a single connection to a
single volume.  That is guaranteed to cause data-structure collisions
that would be impossible otherwise.  Also, let's not forget that either
change is also applicable on the client side, in glusterd, in self-heal
and rebalance, etc.  Many of these have their own unique concerns with
respect to concurrency and reentrancy, and don't already have
io-threads.  For example, I've had to fix several bugs in this area that
were unique to glusterd.  At least we've begun to shake out some of
these issues with own-thread, though I'm sure there are still plenty of
bugs still to be found.  With multi-threaded epoll we're going to have
even more issues in this area, and we've barely begun to discover them.
That's not a fatal problem, but it's definitely a CON.

  * CON: multi-epoll does not work with SSL.  It *can't* work with
  OpenSSL at all, short of adopting a hybrid model where SSL
  connections use own-thread while others use multi-epoll, which is a
  bit of a testing nightmare.
Why is it a testing nightmare?
It means having to test *both* sets of code paths, plus the code to hand
off between them or use them concurrently, in every environment - not
just those where we hand off to io-threads.

IMHO it's worth it to carefully trade off architectural purity
Where does this "architectural purity" idea come from?  This isn't about
architectural purity.  It's about code that's known to work vs. code
that might perform better *in theory* but also presents some new issues
we'd need to address.  I don't like thread-per-connection.  I've
recommended against it many times.  Whoever made the OpenSSL API so
unfriendly to other concurrency approaches was a fool.  Nonetheless,
that's the way the real world is, and *in this particular context* I
think own-thread has a better risk:reward ratio.

In summary, to back own-thread alternative I would need to see that a)
the own-thread approach is scalable, and that b) performance data
shows that own-thread is comparable to multi-thread-epoll in
performance.
Gee, I wonder who we could get to run those tests.  Maybe that would be
better than mere conjecture (including mine).

Otherwise, in the absence of any other candidates, we have to go with
multi-thread-epoll.
*Only* on the basis of performance, ignoring the other issues we've
discussed?  I disagree.  If anything, there seem to be moves afoot to
de-emphasize the traditional NAS-replacement role in favor of more
archival/dispersed workloads.  I don't necessarily agree with that, but
it would make the "performance at any cost" argument even less relevant.

I believe this is an expected Red Hat view since the acquisition of 
InkTank. I'm not accusing anyone of taking a "company line", I just 
expect that there is going to be a shift of focus that will become 
apparent with regard to bug reports, paid customer requirements, etc. 
Upstream, it may not even be recognized. This is, of course, all my 
personal opinion as an outside observer and I could just be talking out 
of a posterior orifice.

P.S. I changed the subject line because I think it's inappropriate to
make this about person vs. person, taking my side or the opposition's.
There has been entirely too much divisive behavior on the list already.
Let's try to focus on the arguments themselves, not who's making them.
+11111111111111111 There's a lot of brilliant (and I don't just mean the 
colloquial European definition) people here who have all done amazing 
things, and are often all correct under their expected parameters.
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel