Re: Resurrecting EPOLLROUNDROBIN

Jason Baron <jbaron@xxxxxxxxxx> · Mon, 25 Mar 2019 12:54:40 -0400

On 3/25/19 7:38 AM, Marek Majkowski wrote:
> Hi,
> 
> Recently we noticed epoll is not helpful for load balancing when
> called on a listen TCP socket. I described this in a blog post:
> 
> https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/
> 
> The short explanation: new connections going to a listen socket are
> not evenly distributed across processes that wait on the EPOLLIN. In
> practice the last process doing epoll_wait() will get the new
> connection. See the trivial program to reproduce:
> 
> https://github.com/cloudflare/cloudflare-blog/blob/master/2017-10-accept-balancing/epoll-and-accept.py
> 
>    $ ./epoll-and-accept.py &
>    $ for i in `seq 6`; do echo | nc localhost 1024; done
>    worker 0
>    worker 0
>    worker 0
>    worker 0
>    worker 0
>    worker 0
> 
> Worker #0 did all the accept() calls. This is because the listen
> socket wait queue is a LIFO (not FIFO!). With current behaviour, the
> process calling epoll_wait() most recently will be woken up first.
> This usually is the busiest process. This leads to uneven load
> distribution across worker processes.
> 
> Notice, described problem is different from what EPOLLEXCLUSIVE tries
> to solve. Exclusive flag is about waking up exactly one process, as
> opposed to default behaviour of waking up all the subscribers
> (thundering herd problem). Without EPOLLEXCLUSIVE the described
> load-balancing problem is less prominent, since there is an inherent
> race when all the woken up processes fight for the new connection. In
> such case the other workers have some chance of getting the new
> connection. The core problem still is there - accept calls are not
> well balanced across waiting processes.
> 
> On a loaded server avoiding EPOLLEXCLUSIVE is wasteful. With high
> number of new connections, and dozens of worker processes, waking up
> everybody on every new connection is suboptimal.
> 
> Notice, that multiple threads doing blocking accept() have a proper
> FIFO behaviour. In other words: you can achieve round-robin load
> balancing by having multiple workers hang on accept(), while you can't
> have that behaviour when waiting in epoll_wait().
> 
> We are using EPOLLEXCLUSIVE, and as a solution to load-balancing
> problem we backported the EPOLLROUNDROBIN patch submitted by Jason
> Byron in 2015. We are running this patch for last 6 months, and it
> helped us to flatten the load across workers (and reduce tail
> latency).
> 
> https://lists.openwall.net/linux-kernel/2015/02/17/723
> 
> (PS. generally speaking EPOLLROUNDROBIN makes no sense in conjunction
> with SO_REUSEPORT sockets)
> 
> Jason, would you mind to resubmit it?
> 
> Cheers,
>     Marek
> 

Hi Marek,

So I think there may have been a couple issues last time. First, I
wasn't convinced if anybody actually wanted this. Sounds like there is
interest now. Second, was that it touched some of the core wakeup bits
and although I don't think any of the scheduler maintainers objected, I
didn't want to change the core wakeup code for this epoll only feature.
I think I can probably register a generic wakeup with the core code and
then have only the epoll code be aware of the round robin behavior. So I
will cook up a patch like that and re-post.

Thanks,

-Jason