EPOLLET behavior and performance

"Brian C. Anderson" <brianderson@xxxxxxxxxx> · Fri, 14 Jul 2017 15:45:48 -0700

Hi,

Hoping I have the right email list for this topic.

I have a patch that improves the per thread-hop time of an
EPOLLET-armed eventfd from ~20us to ~15us on a Pixel phone. A summary
of the performance tests and results can be found here:

https://docs.google.com/spreadsheets/d/17SF9WcKYfkFhqE5fmXAWIXZVKJTvft34fyLxgDgJrZw/edit?usp=sharing

This is actually a side-benefit of an experiment to improve EPOLLET's
contract such that event edges don't "disappear". For example, if an
eventfd write is being used to notify thread A of an event, but thread
B reads the data in the eventfd, there's a race where thread A's
"re-poll" of the eventfd inside epoll_wait can lose to the read,
squashing the notification.

Please see the attached patch and its commit description for details
on how it works. The patch is currently based on v3.18 of the kernel.

A few questions:
1) Do you see any problems with the approach I'm taking?
2) How concerning is backwards compatibility; especially regarding
user code that may not handle being notified of EPOLLIN when the file
isn't actually readable anymore.
3) A ~30% improvement is larger than I would have expected. Any clues
what might be going on? I find it hard to believe improved cache
locality explains it all.
4) Are there existing performance tests I can run the patch against?

Thanks!
Brian
From f00027009ea24b1262d10b64d39dc2d740bbc874 Mon Sep 17 00:00:00 2001
From: Brian Anderson <brianderson@xxxxxxxxxxxx>
Date: Thu, 6 Jul 2017 15:06:55 -0700
Subject: [PATCH 1/3] epoll: Ensure all epoll files are notified of EPOLLET
 events.

Problems with old behavior:
================================
Before this patch, it is possible for some epoll files to miss an
EPOLLET edge. For example, if thread A reads from an fd while thread B
is between calls to epoll_wait, thread B might never be notified of the
fd's EPOLLIN. This is because the old behavior re-polls the fd from
epoll_wait, which is racy.

Benefits of new behavior:
================================
Files may be used both as an event source and as a way to communicate
data without requiring the thread consuming the event and the thread
consuming the data to be the same thread.

Performance tradeoffs:
================================
The old behavior results in fewer kernel space <-> user space
transitions when the user doesn't care to be notified if an event is no
longer valid. However, the new behavior allows an implementation to
cache events and avoid re-polling fds, which will help cases where
polling is expensive or results in lock contention.

Compatibility notes:
================================
This may break backward compatibility with user logic that expects
epoll_wait event flags to be valid. e.g. EPOLLIN implies a read on the
fd will return data and not block. Note: If there are multiple threads
epolling on the fd, it is already possible for one thread to lose the
race and have its first read return nothing. So the new behavior can
only break use cases where a single thread is epolling on and reading
from an fd.

This may break backward compatibility with user logic that expects the
absense of epoll_wait event flags to imply that the event is no longer
valid. Since the re-poll is removed, flags that were valid in the
previous wake up and are still valid, won't be set if there wasn't an
edge triggering event such as a read for EPOLLOUT.

TODO: If these are deal breakers, is it worth adding an EPOLLET2 flag?

This is forward compatible with the EPOLLEXCLUSIVE flag by ensuring
exactly one epoll file is notified of an event. Never zero.

Implementation notes:
================================
This patch adds unsent_events to epitem, which accumulates events as they
are received for an epoll file. unsent_events is only reset after it has
been copied to userspace for one of the threads waiting on the epoll
file.

Some devices do not provide an event mask when it notifies epoll of a
new event via ep_poll_callback. In these cases, we fall back to the old
racy behavior. Falling back correctly is especially tricky to coordinate
with ep_insert since ep_insert doesn't know up-front how the device will
behave. The eventpoll->ovflist logic is re-used to ensure correctness.

Special care is taken to avoid increasing the size of epitem such that
it needs an additional cache line. In particular, tracking unseen_events
while the eventpoll->ovflist is active is done in epoll_entry rather
than epitem.
---
 fs/eventpoll.c | 228 ++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 167 insertions(+), 61 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 08a6a6ee5ecb..7b887ed6b413 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -154,8 +154,18 @@ struct epitem {
 	/* The file descriptor information this item refers to */
 	struct epoll_filefd ffd;
 
-	/* Number of active wait queue attached to poll operations */
-	int nwait;
+	union {
+		/*
+		 * Number of active wait queue attached to poll operations
+		 * Only valid at the start of ep_insert.
+		 */
+		int nwait;
+		/*
+		 * Mask of events that haven't yet been sent to the user.
+		 * Only valid after the start of ep_insert.
+		 */
+		__u32 unsent_events;
+	};
 
 	/* List containing poll wait queues */
 	struct list_head pwqlist;
@@ -238,6 +248,14 @@ struct eppoll_entry {
 
 	/* The wait queue head that linked the "wait" wait queue item */
 	wait_queue_head_t *whead;
+
+	/*
+	 * Backup for epitem->unsent_events while the ovflist is active.
+	 * Accumulated across all epoll_entrys once ovflist is deactivated.
+	 * Added per epoll_entry rather than per epitem to prevent epitem from
+	 * using an additional cache line.
+	 */
+	__u32 ovf_unsent_events;
 };
 
 /* Wrapper struct used by poll queueing */
@@ -577,6 +595,42 @@ static inline void ep_pm_stay_awake_rcu(struct epitem *epi)
 	rcu_read_unlock();
 }
 
+void enter_callback_redirect_locked(struct eventpoll *ep) {
+	ep->ovflist = NULL;
+}
+
+/* Returns true if epi_match was found in the ep->ovflist. */
+bool exit_callback_redirect_locked(struct eventpoll *ep, struct epitem *epi_match) {
+	struct epitem *epi, *nepi;
+	struct eppoll_entry *pwq;
+	bool match_found = false;
+
+	for (nepi = ep->ovflist; (epi = nepi) != NULL;
+		 nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {
+
+		if (epi_match == epi)
+			match_found = true;
+
+		list_for_each_entry(pwq, &epi->pwqlist, llink) {
+			epi->unsent_events |= pwq->ovf_unsent_events;
+			pwq->ovf_unsent_events = 0;
+		}
+
+		if (!ep_is_linked(&epi->rdllink)) {
+			list_add_tail(&epi->rdllink, &ep->rdllist);
+			ep_pm_stay_awake(epi);
+		}
+	}
+	/*
+	 * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
+	 * releasing the lock, events will be queued in the normal way inside
+	 * ep->rdllist.
+	 */
+	ep->ovflist = EP_UNACTIVE_PTR;
+
+	return match_found;
+}
+
 /**
  * ep_scan_ready_list - Scans the ready list in a way that makes possible for
  *                      the scan code, to call f_op->poll(). Also allows for
@@ -597,7 +651,6 @@ static int ep_scan_ready_list(struct eventpoll *ep,
 {
 	int error, pwake = 0;
 	unsigned long flags;
-	struct epitem *epi, *nepi;
 	LIST_HEAD(txlist);
 
 	/*
@@ -618,7 +671,7 @@ static int ep_scan_ready_list(struct eventpoll *ep,
 	 */
 	spin_lock_irqsave(&ep->lock, flags);
 	list_splice_init(&ep->rdllist, &txlist);
-	ep->ovflist = NULL;
+	enter_callback_redirect_locked(ep);
 	spin_unlock_irqrestore(&ep->lock, flags);
 
 	/*
@@ -627,30 +680,13 @@ static int ep_scan_ready_list(struct eventpoll *ep,
 	error = (*sproc)(ep, &txlist, priv);
 
 	spin_lock_irqsave(&ep->lock, flags);
+
 	/*
 	 * During the time we spent inside the "sproc" callback, some
 	 * other events might have been queued by the poll callback.
 	 * We re-insert them inside the main ready-list here.
 	 */
-	for (nepi = ep->ovflist; (epi = nepi) != NULL;
-	     nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {
-		/*
-		 * We need to check if the item is already in the list.
-		 * During the "sproc" callback execution time, items are
-		 * queued into ->ovflist but the "txlist" might already
-		 * contain them, and the list_splice() below takes care of them.
-		 */
-		if (!ep_is_linked(&epi->rdllink)) {
-			list_add_tail(&epi->rdllink, &ep->rdllist);
-			ep_pm_stay_awake(epi);
-		}
-	}
-	/*
-	 * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
-	 * releasing the lock, events will be queued in the normal way inside
-	 * ep->rdllist.
-	 */
-	ep->ovflist = EP_UNACTIVE_PTR;
+	exit_callback_redirect_locked(ep, NULL);
 
 	/*
 	 * Quickly re-inject items left on "txlist".
@@ -1006,6 +1042,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	unsigned long flags;
 	struct epitem *epi = ep_item_from_wait(wait);
 	struct eventpoll *ep = epi->ep;
+	__u32 events;
 
 	if ((unsigned long)key & POLLFREE) {
 		ep_pwq_from_wait(wait)->whead = NULL;
@@ -1035,7 +1072,8 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 	 * callback. We need to be able to handle both cases here, hence the
 	 * test for "key" != NULL before the event match test.
 	 */
-	if (key && !((unsigned long) key & epi->event.events))
+	events = ((unsigned long long)key) & epi->event.events;
+	if (key && !events)
 		goto out_unlock;
 
 	/*
@@ -1057,9 +1095,20 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k
 			}
 
 		}
+		/* If key is 0, ovf_unsent_events and events will always be 0 here. */
+		ep_pwq_from_wait(wait)->ovf_unsent_events |= events;
 		goto out_unlock;
 	}
 
+	/*
+	 * If key is 0, make sure to poll before sending events to the user.
+	 * unsent_events may not be zero here when key is 0 becaues of ep_insert.
+	 */
+	if (key)
+		epi->unsent_events |= events;
+	else
+		epi->unsent_events = 0;
+
 	/* If this file is already in the ready list we exit soon */
 	if (!ep_is_linked(&epi->rdllink)) {
 		list_add_tail(&epi->rdllink, &ep->rdllist);
@@ -1100,6 +1149,7 @@ static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
 		pwq->whead = whead;
 		pwq->base = epi;
 		add_wait_queue(whead, &pwq->wait);
+		pwq->ovf_unsent_events = 0;
 		list_add_tail(&pwq->llink, &epi->pwqlist);
 		epi->nwait++;
 	} else {
@@ -1268,13 +1318,14 @@ static noinline void ep_destroy_wakeup_source(struct epitem *epi)
  * Must be called with "mtx" held.
  */
 static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
-		     struct file *tfile, int fd, int full_check)
+             struct file *tfile, int fd, int full_check)
 {
 	int error, revents, pwake = 0;
 	unsigned long flags;
 	long user_watches;
 	struct epitem *epi;
 	struct ep_pqueue epq;
+	bool racy_callback;
 
 	user_watches = atomic_long_read(&ep->user->epoll_watches);
 	if (unlikely(user_watches >= max_user_watches))
@@ -1299,10 +1350,47 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
 		RCU_INIT_POINTER(epi->ws, NULL);
 	}
 
+	/*
+	 * nwait is invalid after this point.
+	 * unsent_events becomes valid, so initizlize it to zero here.
+	 * Note: unsent_events isn't accessed during callback_redirect sections.
+	 */
+	epi->unsent_events = 0;
+
+	/* Add the current item to the list of active epoll hook for this file */
+	spin_lock(&tfile->f_lock);
+	list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links);
+	spin_unlock(&tfile->f_lock);
+
+	/*
+	 * Add the current item to the RB tree. All RB tree operations are
+	 * protected by "mtx", and ep_insert() is called with "mtx" held.
+	 */
+	ep_rbtree_insert(ep, epi);
+
+	/* now check if we've created too many backpaths */
+	error = -EINVAL;
+	if (full_check && reverse_path_check())
+		goto error_remove_epi;
+
 	/* Initialize the poll table using the queue callback */
 	epq.epi = epi;
 	init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
 
+	/* Redirect callbacks so we can avoid races with callbacks that
+	 * happen while ep->lock isn't locked.
+	 * We cannot hold ep->lock while calling ep_item_poll since the device
+	 * may acquire it's wait queue lock, which is the incorrect order.
+	 * If a callback with key==0 is received, we must poll for the real events.
+	 * We can't just OR it with revents below since that could cause us to
+	 * drop events that should have been sent to the user.
+	 * If all devices never set key to 0, redirecting the callbacks here
+	 * shouldn't be needed.
+	 */
+	spin_lock_irqsave(&ep->lock, flags);
+	enter_callback_redirect_locked(ep);
+	spin_unlock_irqrestore(&ep->lock, flags);
+
 	/*
 	 * Attach the item to the poll hooks and get current event bits.
 	 * We can safely use the file* here because its usage count has
@@ -1321,27 +1409,28 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
 	if (epi->nwait < 0)
 		goto error_unregister;
 
-	/* Add the current item to the list of active epoll hook for this file */
-	spin_lock(&tfile->f_lock);
-	list_add_tail_rcu(&epi->fllink, &tfile->f_ep_links);
-	spin_unlock(&tfile->f_lock);
+	spin_lock_irqsave(&ep->lock, flags);
 
 	/*
-	 * Add the current item to the RB tree. All RB tree operations are
-	 * protected by "mtx", and ep_insert() is called with "mtx" held.
+	 * Collect callbacks that may have occured between ep_item_poll
+	 * and ep->lock acquisition.
 	 */
-	ep_rbtree_insert(ep, epi);
-
-	/* now check if we've created too many backpaths */
-	error = -EINVAL;
-	if (full_check && reverse_path_check())
-		goto error_remove_epi;
-
-	/* We have to drop the new item inside our item list to keep track of it */
-	spin_lock_irqsave(&ep->lock, flags);
-
-	/* If the file is already "ready" we drop it inside the ready list */
-	if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {
+	racy_callback = exit_callback_redirect_locked(ep, epi);
+	if (racy_callback) {
+		/*
+		 * Don't add revents if the |key| received in the ep_poll_callback was
+		 * not used. This ensures we poll for the real events before sending
+		 * them to the user.
+		 */
+		if (epi->unsent_events)
+			epi->unsent_events |= revents;
+		/*
+		 * Don't worry about waking anything since it's already been handled
+		 * by exit_callback_redirect_locked.
+		 */
+	} else if (revents) {
+		/* If the file is already "ready" we drop it inside the ready list */
+		epi->unsent_events = revents;
 		list_add_tail(&epi->rdllink, &ep->rdllist);
 		ep_pm_stay_awake(epi);
 
@@ -1362,28 +1451,20 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
 
 	return 0;
 
-error_remove_epi:
-	spin_lock(&tfile->f_lock);
-	list_del_rcu(&epi->fllink);
-	spin_unlock(&tfile->f_lock);
-
-	rb_erase(&epi->rbn, &ep->rbr);
-
 error_unregister:
 	ep_unregister_pollwait(ep, epi);
+	wakeup_source_unregister(ep_wakeup_source(epi));
 
-	/*
-	 * We need to do this because an event could have been arrived on some
-	 * allocated wait queue. Note that we don't care about the ep->ovflist
-	 * list, since that is used/cleaned only inside a section bound by "mtx".
-	 * And ep_insert() is called with "mtx" held.
-	 */
 	spin_lock_irqsave(&ep->lock, flags);
-	if (ep_is_linked(&epi->rdllink))
-		list_del_init(&epi->rdllink);
+	exit_callback_redirect_locked(ep, NULL);
 	spin_unlock_irqrestore(&ep->lock, flags);
 
-	wakeup_source_unregister(ep_wakeup_source(epi));
+error_remove_epi:
+	spin_lock(&tfile->f_lock);
+	list_del_rcu(&epi->fllink);
+	spin_unlock(&tfile->f_lock);
+
+	rb_erase(&epi->rbn, &ep->rbr);
 
 error_create_wakeup_source:
 	kmem_cache_free(epi_cache, epi);
@@ -1400,6 +1481,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even
 	int pwake = 0;
 	unsigned int revents;
 	poll_table pt;
+	__u32 new_events = ~epi->event.events & event->events;
 
 	init_poll_funcptr(&pt, NULL);
 
@@ -1443,12 +1525,22 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even
 	 */
 	revents = ep_item_poll(epi, &pt);
 
+	/*
+	 * For edge-triggered monitoring, don't notify for events that were
+	 * monitored before this call. If there was an edge, it should be already
+	 * be reflected in epi->unsent_events.
+	 */
+	if (epi->event.events & EPOLLET) {
+		revents &= new_events;
+	}
+
 	/*
 	 * If the item is "hot" and it is not registered inside the ready
 	 * list, push it inside.
 	 */
-	if (revents & event->events) {
+	if (revents) {
 		spin_lock_irq(&ep->lock);
+		epi->unsent_events |= revents;
 		if (!ep_is_linked(&epi->rdllink)) {
 			list_add_tail(&epi->rdllink, &ep->rdllist);
 			ep_pm_stay_awake(epi);
@@ -1509,7 +1601,21 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
 
 		list_del_init(&epi->rdllink);
 
-		revents = ep_item_poll(epi, &pt);
+		/*
+		 * If level triggered: Re-poll to avoid unnecessary notifications.
+		 *
+		 * If edge triggered: Avoid re-poll if events are cached in
+		 * unsent_events. This way, all epoll files are notified regardless of
+		 * racy state changes in the poll state; for example, if thread A reads
+		 * from an fd with unsent events while thread B is between calls
+		 * to epoll wait, thread B could miss the edge.
+		 */
+		if ((epi->event.events & EPOLLET) && epi->unsent_events) {
+			revents = epi->unsent_events & epi->event.events;
+		} else {
+			revents = ep_item_poll(epi, &pt);
+		}
+		epi->unsent_events = 0;
 
 		/*
 		 * If the event mask intersect the caller-requested one,
-- 
2.13.2.932.g7449e964c-goog

From 242ef9f428789689ba3f53c26ef7be218066f0d2 Mon Sep 17 00:00:00 2001
From: Brian Anderson <brianderson@xxxxxxxxxxxx>
Date: Wed, 12 Jul 2017 16:05:50 -0700
Subject: [PATCH 2/3] Lock free eventfd_poll.

---
 fs/eventfd.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index d6a88e7812f3..0d2e9de5c69d 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -118,18 +118,18 @@ static unsigned int eventfd_poll(struct file *file, poll_table *wait)
 {
 	struct eventfd_ctx *ctx = file->private_data;
 	unsigned int events = 0;
-	unsigned long flags;
+	__u64 count;
 
 	poll_wait(file, &ctx->wqh, wait);
 
-	spin_lock_irqsave(&ctx->wqh.lock, flags);
-	if (ctx->count > 0)
+	smp_mb();
+	count = ctx->count;
+	if (count > 0)
 		events |= POLLIN;
-	if (ctx->count == ULLONG_MAX)
+	if (count == ULLONG_MAX)
 		events |= POLLERR;
-	if (ULLONG_MAX - 1 > ctx->count)
+	if (ULLONG_MAX - 1 > count)
 		events |= POLLOUT;
-	spin_unlock_irqrestore(&ctx->wqh.lock, flags);
 
 	return events;
 }
-- 
2.13.2.932.g7449e964c-goog

From 936d1dcdd4f0ed3fcd6a55dea196fcb3d0a0a96e Mon Sep 17 00:00:00 2001
From: Brian Anderson <brianderson@xxxxxxxxxxxx>
Date: Wed, 12 Jul 2017 16:50:04 -0700
Subject: [PATCH 3/3] EPOLLET Performance Tests

---
 epollet_test_1M_eventfd_edges.cpp | 105 +++++++++++++++++++++++++++++++++
 epollet_test_cycle.cpp            | 119 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 224 insertions(+)
 create mode 100644 epollet_test_1M_eventfd_edges.cpp
 create mode 100644 epollet_test_cycle.cpp

diff --git a/epollet_test_1M_eventfd_edges.cpp b/epollet_test_1M_eventfd_edges.cpp
new file mode 100644
index 000000000000..3f1bbee640fa
--- /dev/null
+++ b/epollet_test_1M_eventfd_edges.cpp
@@ -0,0 +1,105 @@
+// Sets up a single epolling thread that listens to N threads notifying
+// an eventfd in a tight loop.
+// Reports how long it takes for the epolling thread to detect 1M event
+// edges as well as how many eventfd writes each thread could make.
+
+// Copied from a test added to SurfaceFlinger.cpp.
+// Apologies for the Android specific code.
+
+class EpollEtEventFdWriterThread : public Thread {
+public:
+    EpollEtEventFdWriterThread(int epollFd, std::atomic<bool>* stop)
+        : mEpollFd(epollFd), mStop(stop) {}
+
+    bool threadLoop() override {
+        ALOGE("EPOLLET writer start.");
+        EventFd eventfd(0);
+
+        epoll_event epollEvent;
+        epollEvent.events = EPOLLIN | EPOLLET;
+        epollEvent.data.ptr = static_cast<void*>(&eventfd);
+        errno = 0;
+        int ret = epoll_ctl(mEpollFd, EPOLL_CTL_ADD, eventfd.fd(), &epollEvent);
+        ALOGE_IF(ret == -1, "epoll_ctl (add) failed with %d", errno);
+
+        while(!mStop->load()) {
+            eventfd.write(1);
+            mWriteCount++;
+        }
+
+        ALOGE("EPOLLET writer end.");
+        return false;
+    }
+
+    int mEpollFd;
+    std::atomic<bool>* mStop;
+    uint64_t mWriteCount = 0;
+};
+
+static void epolletest_run(int iteration, int writerCount) {
+    ALOGE("EPOLLET start.");
+    int epollFd = epoll_create1(EPOLL_CLOEXEC);
+    std::atomic<bool> stop(false);
+
+    std::vector<sp<EpollEtEventFdWriterThread>> writers(writerCount, nullptr);
+    for (auto &w : writers) {
+        w = new EpollEtEventFdWriterThread(epollFd, &stop);
+        w->run("EventFdWriter");
+    }
+
+    constexpr int maxEventCount = 32;
+    epoll_event events[maxEventCount];
+    std::vector<uint64_t> readyCountHistogram(writerCount+1, 0);
+    int readyCount = 0;
+    uint64_t notifications = 0;
+
+    const nsecs_t startTime = systemTime();
+    while (notifications < 1000000) {
+        do {
+            errno = 0;
+            readyCount = epoll_wait(epollFd, events, maxEventCount, -1);
+        } while (errno == EINTR);
+
+        readyCountHistogram[readyCount]++;
+
+        for (int i = 0; i < readyCount; i++) {
+            if ((events[i].events & EPOLLIN) != 0) {
+                notifications++;
+            } else {
+                ALOGE("epoll_wait: unexpected mask: %d", events[i].events);
+            }
+        }
+    }
+    const nsecs_t endTime = systemTime();
+    ALOGE("EPOLLET end.");
+
+    stop = true;
+    for (auto &w : writers) {
+        ALOGE("EPOLLET join.");
+        w->join();
+    }
+    ALOGE("EPOLLET writers done.");
+    close(epollFd);
+
+    std::ostringstream result;
+    const int64_t runTime = endTime - startTime;
+    result << iteration << " " << writerCount << " " << runTime;
+    result << " wc:";
+    for (auto &w : writers) {
+        result << " " << w->mWriteCount;
+    }
+    result << " rch:";
+    for (auto &bucketSize : readyCountHistogram) {
+        result << " " << bucketSize;
+    }
+    ALOGE("EPOLLET result: %s", result.str().c_str());
+}
+
+void epolletest() {
+    std::vector<int> writerCounts = {1,2,4,8,16};
+    for (int i = 0; i < 201; i++) {
+        for (auto wc : writerCounts) {
+            epolletest_run(i, wc);
+        }
+    }
+}
diff --git a/epollet_test_cycle.cpp b/epollet_test_cycle.cpp
new file mode 100644
index 000000000000..201f01fa21a7
--- /dev/null
+++ b/epollet_test_cycle.cpp
@@ -0,0 +1,119 @@
+// Sets up N epolling threads connected in a cycle via eventfds.
+// Each thread notifies it's next neighbor after being notified
+// by its previous neighbor.
+// Records how many laps can complete in 2 seconds.
+
+// Copied from a test added to SurfaceFlinger.cpp.
+// Apologies for the Android specific code.
+
+class EpollEtThread : public Thread {
+public:
+    EpollEtThread(std::atomic<bool>* stop) : mEventFd(0), mStop(stop) {
+        mEpollFd = epoll_create1(EPOLL_CLOEXEC);
+
+        epoll_event epollEvent;
+        epollEvent.events = EPOLLIN | EPOLLET;
+        epollEvent.data.ptr = static_cast<void*>(&mEventFd);
+        errno = 0;
+        int ret = epoll_ctl(mEpollFd, EPOLL_CTL_ADD, mEventFd.fd(), &epollEvent);
+        ALOGE_IF(ret == -1, "epoll_ctl (add) failed with %d", errno);
+    }
+
+    ~EpollEtThread() {
+        close(mEpollFd);
+    }
+
+    void setTarget(EpollEtThread* target) {
+        mTarget = target;
+    }
+
+    bool threadLoop() override {
+        ALOGE("EPOLLET writer start.");
+
+        constexpr int maxEventCount = 32;
+        epoll_event events[maxEventCount];
+        int readyCount = 0;
+        uint64_t notifications = 0;
+        while (!mStop->load()) {
+            do {
+                errno = 0;
+                readyCount = epoll_wait(mEpollFd, events, maxEventCount, -1);
+            } while (errno == EINTR);
+
+            for (int i = 0; i < readyCount; i++) {
+                if ((events[i].events & EPOLLIN) != 0) {
+                    notifications++;
+                } else {
+                    ALOGE("epoll_wait: unexpected mask: %d", events[i].events);
+                }
+            }
+
+            // Notify next thread in cycle.
+            mTarget->mEventFd.write(1);
+        }
+
+        // Make sure next thread in cycle wakes up to sleep.
+        mTarget->mEventFd.write(1);
+
+        ALOGE("EPOLLET writer end.");
+        return false;
+    }
+
+    int mEpollFd;
+    EventFd mEventFd;
+    EpollEtThread* mTarget = nullptr;
+    std::atomic<bool>* mStop;
+};
+
+static void epolletest_circle_run(int iteration, int writerCount) {
+    ALOGE("EPOLLET start.");
+
+    std::atomic<bool> stop(false);
+
+    std::vector<sp<EpollEtThread>> writers(writerCount, nullptr);
+    for (auto &w : writers) {
+        w = new EpollEtThread(&stop);
+    }
+    for (size_t i = 1; i < writers.size(); i++) {
+        writers[i-1]->setTarget(writers[i].get());
+    }
+    writers[writers.size()-1]->setTarget(writers[0].get());
+    for (auto &w : writers) {
+        w->run("EventFdWriter");
+    }
+
+    int dummyEpoll = epoll_create1(EPOLL_CLOEXEC);
+
+    // Kick off the circle of dominoes.
+    const nsecs_t startTime = systemTime();
+    writers[0]->mEventFd.write(1);
+
+    constexpr int maxEventCount = 32;
+    epoll_event events[maxEventCount];
+    epoll_wait(dummyEpoll, events, maxEventCount, 2000);
+    stop = true;
+    for (auto &w : writers) {
+        ALOGE("EPOLLET join.");
+        w->join();
+    }
+    const nsecs_t endTime = systemTime();
+    ALOGE("EPOLLET writers done.");
+
+    close(dummyEpoll);
+
+    uint64_t lapCount = 0;
+    writers[0]->mEventFd.read(&lapCount);
+    std::ostringstream result;
+    const int64_t runTime = endTime - startTime;
+    result << iteration << " " << writerCount << " " << runTime << " " << lapCount;
+    ALOGE("EPOLLET result: %s", result.str().c_str());
+}
+
+void epolletest_circle() {
+    std::vector<int> writerCounts = {1,2,4,8,16};
+    for (int i = 0; i < 201; i++) {
+        for (auto wc : writerCounts) {
+            epolletest_circle_run(i, wc);
+        }
+    }
+}
-- 
2.13.2.932.g7449e964c-goog