On 02/18/2015 11:33 AM, Ingo Molnar wrote: > * Jason Baron <jbaron@xxxxxxxxxx> wrote: > >>> This has two main advantages: firstly it solves the >>> O(N) (micro-)problem, but it also more evenly >>> distributes events both between task-lists and within >>> epoll groups as tasks as well. >> Its solving 2 issues - spurious wakeups, and more even >> loading of threads. The event distribution is more even >> between 'epoll groups' with this patch, however, if >> multiple threads are blocking on a single 'epoll group', >> this patch does not affect the the event distribution >> there. [...] > Regarding your last point, are you sure about that? > > If we have say 16 epoll threads registered, and if the list > is static (no register/unregister activity), then the > wakeup pattern is in strict order of the list: threads > closer to the list head will be woken more frequently, in a > wake-once fashion. So if threads do just quick work and go > back to sleep quickly, then typically only the first 2-3 > threads will get any runtime in practice - the wakeup > iteration never gets 'deep' into the list. > > With the round-robin shuffling of the list, the threads get > shuffled to the tail on wakeup, which distributes events > evenly: all 16 epoll threads will accumulate an even > distribution of runtime, statistically. > > Have I misunderstood this somehow? > > So in the case of multiple threads per epoll set, we currently add to the head of wakeup queue exclusively in 'epoll_wait()', and then subsequently remove from the queue once 'epoll_wait()' returns. So I don't think this patch addresses balancing on a per epoll set basis. I think we could address the case you describe by simply doing __add_wait_queue_tail_exclusive() instead of __add_wait_queue_exclusive() in epoll_wait(). However, I think the userspace API change is less clear since epoll_wait() doesn't currently have an 'input' events argument as epoll_ctl() does. Thanks, -Jason -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html