On Feb 18, 2015 9:38 AM, "Jason Baron" <jbaron@xxxxxxxxxx> wrote: > > On 02/18/2015 11:33 AM, Ingo Molnar wrote: > > * Jason Baron <jbaron@xxxxxxxxxx> wrote: > > > >>> This has two main advantages: firstly it solves the > >>> O(N) (micro-)problem, but it also more evenly > >>> distributes events both between task-lists and within > >>> epoll groups as tasks as well. > >> Its solving 2 issues - spurious wakeups, and more even > >> loading of threads. The event distribution is more even > >> between 'epoll groups' with this patch, however, if > >> multiple threads are blocking on a single 'epoll group', > >> this patch does not affect the the event distribution > >> there. [...] > > Regarding your last point, are you sure about that? > > > > If we have say 16 epoll threads registered, and if the list > > is static (no register/unregister activity), then the > > wakeup pattern is in strict order of the list: threads > > closer to the list head will be woken more frequently, in a > > wake-once fashion. So if threads do just quick work and go > > back to sleep quickly, then typically only the first 2-3 > > threads will get any runtime in practice - the wakeup > > iteration never gets 'deep' into the list. > > > > With the round-robin shuffling of the list, the threads get > > shuffled to the tail on wakeup, which distributes events > > evenly: all 16 epoll threads will accumulate an even > > distribution of runtime, statistically. > > > > Have I misunderstood this somehow? > > > > > > So in the case of multiple threads per epoll set, we currently > add to the head of wakeup queue exclusively in 'epoll_wait()', > and then subsequently remove from the queue once > 'epoll_wait()' returns. So I don't think this patch addresses > balancing on a per epoll set basis. > > I think we could address the case you describe by simply doing > __add_wait_queue_tail_exclusive() instead of > __add_wait_queue_exclusive() in epoll_wait(). However, I think > the userspace API change is less clear since epoll_wait() doesn't > currently have an 'input' events argument as epoll_ctl() does. FWIW there's currently discussion about adding a new epoll API for batch epoll_ctl. It could be with coordinating with that effort if some variant could address both use cases. I'm still nervous about changing the per-fd wakeup stuff to do anything other than waking everything. After all, epoll and poll can be used concurrently. What about a slightly different approach: could an epoll fd support multiple contexts? For example, an fd could be set (with epoll_ctl or the new batch stuff) to wake an any epoll waiter, one specific epoll waiter, an epoll waiter preferably on the waking cpu, etc. This would have the benefit of keeping the wakeup changes localized to the epoll code. --Andy > > Thanks, > > -Jason > -- > To unsubscribe from this list: send the line "unsubscribe linux-api" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html