[CC += linux-api@xxxxxxxxxxxxxxx] Jason, Since this is a kernel-user-space API change, please CC linux-api@. The kernel source file Documentation/SubmitChecklist notes that all Linux kernel patches that change userspace interfaces should be CCed to linux-api@xxxxxxxxxxxxxxx, so that the various parties who are interested in API changes are informed. For further information, see https://www.kernel.org/doc/man-pages/linux-api-ml.html Thanks, Michael On Mon, Feb 9, 2015 at 9:05 PM, Jason Baron <jbaron@xxxxxxxxxx> wrote: > Hi, > > When we are sharing a wakeup source among multiple epoll fds, we end up with > thundering herd wakeups, since there is currently no way to add to the > wakeup source exclusively. This series introduces 2 new epoll flags, > EPOLLEXCLUSIVE for adding to a wakeup source exclusively. And EPOLLROUNDROBIN > which is to be used in conjunction to EPOLLEXCLUSIVE to evenly > distribute the wakeups. I'm showing perf results from the simple pipe() usecase > below. But this patch was originally motivated by a desire to improve > wakeup balance and cpu usage for a shared listen socket(). > > Perf stat, 3.19.0-rc7+, 4 core, Intel(R) Xeon(R) CPU E3-1265L v3 @ 2.50GHz: > > pipe test wake all: > > Performance counter stats for './wake': > > 10837.480396 task-clock (msec) # 1.879 CPUs utilized > 2047108 context-switches # 0.189 M/sec > 214491 cpu-migrations # 0.020 M/sec > 247 page-faults # 0.023 K/sec > 23655687888 cycles # 2.183 GHz > <not supported> stalled-cycles-frontend > <not supported> stalled-cycles-backend > 11242141621 instructions # 0.48 insns per cycle > 2313479486 branches # 213.470 M/sec > 13679036 branch-misses # 0.59% of all branches > > 5.768295821 seconds time elapsed > > pipe test wake balanced: > > Performance counter stats for './wake -o': > > 291.250312 task-clock (msec) # 0.094 CPUs utilized > 40308 context-switches # 0.138 M/sec > 1448 cpu-migrations # 0.005 M/sec > 248 page-faults # 0.852 K/sec > 646407197 cycles # 2.219 GHz > <not supported> stalled-cycles-frontend > <not supported> stalled-cycles-backend > 364256883 instructions # 0.56 insns per cycle > 65775397 branches # 225.838 M/sec > 535637 branch-misses # 0.81% of all branches > > 3.086694452 seconds time elapsed > > Rough epoll manpage text: > > EPOLLEXCLUSIVE > Provides exclusive wakeups when attaching multiple epoll fds to a > shared wakeup source. Must be specified on an EPOLL_CTL_ADD operation. > > EPOLLROUNDROBIN > Provides balancing for exclusive wakeups when attaching multiple epoll > fds to a shared wakeup soruce. Must be specificed with EPOLLEXCLUSIVE > during an EPOLL_CTL_ADD operation. > > > Thanks, > > -Jason > > #include <unistd.h> > #include <sys/epoll.h> > #include <stdio.h> > #include <stdlib.h> > #include <pthread.h> > > #define NUM_THREADS 100 > #define NUM_EVENTS 20000 > #define EPOLLEXCLUSIVE (1 << 28) > #define EPOLLBALANCED (1 << 27) > > int optimize, exclusive; > int p[2]; > pthread_t threads[NUM_THREADS]; > int event_count[NUM_THREADS]; > > struct epoll_event evt = { > .events = EPOLLIN > }; > > void die(const char *msg) { > perror(msg); > exit(-1); > } > > void *run_func(void *ptr) > { > int i = 0; > int j = 0; > int ret; > int epfd; > char buf[4]; > int id = *(int *)ptr; > int *contents; > > if ((epfd = epoll_create(1)) < 0) > die("create"); > > if (optimize) > evt.events |= ((EPOLLBALANCED | EPOLLEXCLUSIVE)); > else if (exclusive) > evt.events |= EPOLLEXCLUSIVE; > ret = epoll_ctl(epfd, EPOLL_CTL_ADD, p[0], &evt); > if (ret) > perror("epoll_ctl add error!\n"); > > while (1) { > ret = epoll_wait(epfd, &evt, 10000, -1); > ret = read(p[0], buf, sizeof(int)); > if (ret == 4) > event_count[id]++; > } > } > > int main(int argc, char *argv[]) > { > int ret, i, j; > int id[NUM_THREADS]; > int total = 0; > int nohit = 0; > int extra_wakeups = 0; > > if (argc == 2) { > if (strcmp(argv[1], "-o") == 0) > optimize = 1; > if (strcmp(argv[1], "-e") == 0) > exclusive = 1; > } > > if (pipe(p) < 0) > die("pipe"); > > for (i = 0; i < NUM_THREADS; i++) { > id[i] = i; > pthread_create(&threads[i], NULL, run_func, &id[i]); > } > > for (j = 0; j < NUM_EVENTS; j++) { > write(p[1], p, sizeof(int)); > usleep(100); > } > > for (i = 0; i < NUM_THREADS; i++) { > pthread_cancel(threads[i]); > printf("joined: %d\n", i); > printf("event count: %d\n", event_count[i]); > total += event_count[i]; > if (!event_count[i]) > nohit++; > } > > printf("total events is: %d\n", total); > printf("nohit is: %d\n", nohit); > } > > > Jason Baron (2): > sched/wait: add round robin wakeup mode > epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN > > fs/eventpoll.c | 25 ++++++++++++++++++++----- > include/linux/wait.h | 11 +++++++++++ > include/uapi/linux/eventpoll.h | 6 ++++++ > kernel/sched/wait.c | 5 ++++- > 4 files changed, 41 insertions(+), 6 deletions(-) > > -- > 1.8.2.rc2 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Author of "The Linux Programming Interface", http://blog.man7.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html