On 3/25/19 7:38 AM, Marek Majkowski wrote: > Hi, > > Recently we noticed epoll is not helpful for load balancing when > called on a listen TCP socket. I described this in a blog post: > > https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/ > > The short explanation: new connections going to a listen socket are > not evenly distributed across processes that wait on the EPOLLIN. In > practice the last process doing epoll_wait() will get the new > connection. See the trivial program to reproduce: > > https://github.com/cloudflare/cloudflare-blog/blob/master/2017-10-accept-balancing/epoll-and-accept.py > > $ ./epoll-and-accept.py & > $ for i in `seq 6`; do echo | nc localhost 1024; done > worker 0 > worker 0 > worker 0 > worker 0 > worker 0 > worker 0 > > Worker #0 did all the accept() calls. This is because the listen > socket wait queue is a LIFO (not FIFO!). With current behaviour, the > process calling epoll_wait() most recently will be woken up first. > This usually is the busiest process. This leads to uneven load > distribution across worker processes. > > Notice, described problem is different from what EPOLLEXCLUSIVE tries > to solve. Exclusive flag is about waking up exactly one process, as > opposed to default behaviour of waking up all the subscribers > (thundering herd problem). Without EPOLLEXCLUSIVE the described > load-balancing problem is less prominent, since there is an inherent > race when all the woken up processes fight for the new connection. In > such case the other workers have some chance of getting the new > connection. The core problem still is there - accept calls are not > well balanced across waiting processes. > > On a loaded server avoiding EPOLLEXCLUSIVE is wasteful. With high > number of new connections, and dozens of worker processes, waking up > everybody on every new connection is suboptimal. > > Notice, that multiple threads doing blocking accept() have a proper > FIFO behaviour. In other words: you can achieve round-robin load > balancing by having multiple workers hang on accept(), while you can't > have that behaviour when waiting in epoll_wait(). > > We are using EPOLLEXCLUSIVE, and as a solution to load-balancing > problem we backported the EPOLLROUNDROBIN patch submitted by Jason > Byron in 2015. We are running this patch for last 6 months, and it > helped us to flatten the load across workers (and reduce tail > latency). > > https://lists.openwall.net/linux-kernel/2015/02/17/723 > > (PS. generally speaking EPOLLROUNDROBIN makes no sense in conjunction > with SO_REUSEPORT sockets) > > Jason, would you mind to resubmit it? > > Cheers, > Marek > Hi Marek, So I think there may have been a couple issues last time. First, I wasn't convinced if anybody actually wanted this. Sounds like there is interest now. Second, was that it touched some of the core wakeup bits and although I don't think any of the scheduler maintainers objected, I didn't want to change the core wakeup code for this epoll only feature. I think I can probably register a generic wakeup with the core code and then have only the epoll code be aware of the round robin behavior. So I will cook up a patch like that and re-post. Thanks, -Jason