Jason Baron <jbaron@xxxxxxxxxx> wrote: > On 02/18/2015 12:51 PM, Ingo Molnar wrote: > > * Ingo Molnar <mingo@xxxxxxxxxx> wrote: > > > >>> [...] However, I think the userspace API change is less > >>> clear since epoll_wait() doesn't currently have an > >>> 'input' events argument as epoll_ctl() does. > >> ... but the change would be a bit clearer and somewhat > >> more flexible: LIFO or FIFO queueing, right? > >> > >> But having the queueing model as part of the epoll > >> context is a legitimate approach as well. > > Btw., there's another optimization that the networking code > > already does when processing incoming packets: waking up a > > thread on the local CPU, where the wakeup is running. > > > > Doing the same on epoll would have real scalability > > advantages where incoming events are IRQ driven and are > > distributed amongst multiple CPUs. > > > > Where events are task driven the scheduler will already try > > to pair up waker and wakee so it might not show up in > > measurements that markedly. > > > > Right, so this makes me think that we may want to potentially > support a variety of wakeup policies. Adding these to the > generic wake up code is just going to be too messy. So, perhaps > a better approach here would be to register a single > wait_queue_t with the event source queue that will always > be woken up, and then layer any epoll balancing/irq affinity > policies on top of that. So in essence we end up with sort of > two queues layers, but I think it provides much nicer isolation > between layers. Also, the bulk of the changes are going to be > isolated to the epoll code, and we avoid Andy's concern about > missing, or starving out wakeups. > > So here's a stab at how this API could look: > > 1. ep1 = epoll_create1(EPOLL_POLICY); > > So EPOLL_POLICY here could the round robin policy described > here, or the irq affinity or other ideas. The idea is to create > an fd that is local to the process, such that other processes > can not subsequently attach to it and affect our policy. I'm not against defining more policies if needed. Maybe FIFO vs LIFO is a good case for this. For affinity, it could probably be done transparently based on epoll_wait retrievals + EPOLL_CTL_MOD operations. > 2. epoll_ctl(ep1, EPOLL_CTL_ADD, fd_source, NULL); > > This associates ep1 with the event source. ep1 can be > associated with or added to at most 1 wakeup source. This call > would largely just form the association, but not queue anything > to the fd_source wait queue. This would mean one extra FD for every fd_source, but that's only a handful of FDs (listen sockets), correct? > 3. epoll_ctl(ep2, EPOLL_CTL_ADD, ep1, event); > epoll_ctl(ep3, EPOLL_CTL_ADD, ep1, event); > epoll_ctl(ep4, EPOLL_CTL_ADD, ep1, event); > . > . > . > > Finally, we add the epoll sets to the event source (indirectly via > ep1). So the first add would actually queue the callback to the > fd_source. While the subsequent calls would simply queue things > to the 'nested' wakeup queue associated with ep1. I'm not sure I follow, wouldn't this increase the number of wakeups? > So any existing epoll/poll/select calls could be queued as well > to fd_source and will operate independenly from this mechanism, > as the fd_source queue continues to be 'wake all'. Also, there > should be no changes necessary to __wake_up_common(), other > than potentially passing more back though the > wait_queue_func_t, such as 'nr_exclusive'. -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html