Re: Resurrecting EPOLLROUNDROBIN

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 3/25/19 8:23 PM, Andy Lutomirski wrote:
> On Mon, Mar 25, 2019 at 4:38 AM Marek Majkowski <marek@xxxxxxxxxxxxxx> wrote:
>>
>> Hi,
>>
>> Recently we noticed epoll is not helpful for load balancing when
>> called on a listen TCP socket. I described this in a blog post:
>>
>> https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/
>>
>> The short explanation: new connections going to a listen socket are
>> not evenly distributed across processes that wait on the EPOLLIN. In
>> practice the last process doing epoll_wait() will get the new
>> connection. See the trivial program to reproduce:
>>
>> https://github.com/cloudflare/cloudflare-blog/blob/master/2017-10-accept-balancing/epoll-and-accept.py
>>
>>    $ ./epoll-and-accept.py &
>>    $ for i in `seq 6`; do echo | nc localhost 1024; done
>>    worker 0
>>    worker 0
>>    worker 0
>>    worker 0
>>    worker 0
>>    worker 0
>>
>> Worker #0 did all the accept() calls. This is because the listen
>> socket wait queue is a LIFO (not FIFO!). With current behaviour, the
>> process calling epoll_wait() most recently will be woken up first.
>> This usually is the busiest process. This leads to uneven load
>> distribution across worker processes.
> 
> I recall a discussion of this at a conference several years ago, but
> it's been several years.  Anyway:
> 
> I read the blog post, and I looked at your example, and the kernel
> behavior actually seems quite sane to me.  From the kernel's
> perspective, if you're calling accept in a loop in a bunch of threads
> (mediated by epoll or otherwise), and one of those threads is able to
> call accept() fast enough, then that thread *should* get all the
> sockets.  It's cache hot, and bouncing around is expensive.

Yes, the EPOLLEXCLUSIVE flag, was what we ended up after the last set of
discussions on this. Its meant as sort of a sane wakeup behavior when
you have one event source fd, that is attached to multiple epoll fds, or
epfds. Without the EPOLLEXCLUSIVE flag, you end up with all of the epfds
getting woken up.

> 
> Now obviously the overall behavior here is suboptimal, but that's
> arguably because the user process is being silly, not because the
> kernel is doing it wrong.  Shouldn't the user process take the newly
> accepted socket and hand it off to an appropriate thread for
> servicing?  If I were doing this, I'd get a freshly accepted socket
> and either forward it to a thread (or process) that is appropriately
> lightly loaded or, even better, that is pinned to the CPU that RFS has
> assigned to the flow assuming that that thread isn't overloaded.  If
> the program is using threads, then this doesn't need to involve the
> kernel at all and, if it's using processes, then SCM_RIGHTS would do
> the trick.  But asking the kernel to arbitrarily and awkwardly
> round-robin the sockets and then keeping the flows on the threads that
> get picked means that, at best, each thread gets an arbitrary
> selection of flows and the balancing isn't particularly good.
> 
> Now, if someone were to actually try doing this in userspace and it
> was too slow, I could see adding some kernel mechanisms to accelerate
> the process.  Perhaps a mechanism to ask to accept only new
> connections that are RFSified to the calling CPU would be useful.  But
> this shouldn't be an *epoll* mechanism, since there is no actual
> guarantee that the CPU that returns first from epoll_wait() is the
> same CPU that calls accept() under load.  (Under load, multiple new
> connections could come in and wake multiple CPUs before any of them
> manage to call accept().)
> 
> So I think that EPOLLROUNDROBIN is not a great solution to the
> problem, and I think that the problem isn't obviously a *kernel*
> problem in the first place.
> 

Another point in this direction is, yes you can try to 'balance' things
at accept() time, but over time things can get very unbalanced. Long
lived connections, for example, could end up on some cpus and not
others. So it seems like, some sort of periodic load balancing would be
necessary anyways.

That said, I'm not against some basic wakeup distribution strategies in
the kernel, such as round robin. Especially, if they can be entirely
contained to the epoll layer (which I think we can do). But clearly we
don't want to introduce a new wakeup distribution strategy for every
use-case.

Thanks,

-Jason




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux