On Thu, Mar 21, 2013 at 4:52 AM, Eric Wong <normalperson@xxxxxxxx> wrote: > This is still not a proper commit, I've lightly tested this. > > Replace the spinlock-protected linked list ready list with wfcqueue. > > This improves performance under heavy, multi-threaded workloads with > multiple threads calling epoll_wait. > > Using my multi-threaded EPOLLONESHOT microbenchmark, performance is > nearly doubled: > > $ eponeshotmt -t 4 -w 4 -f 10 -c 1000000 > > Before: > real 0m 9.58s > user 0m 1.22s > sys 0m 37.08s > > After: > real 0m 5.00s > user 0m 1.07s > sys 0m 18.92s > > ref: http://yhbt.net/eponeshotmt.c > > Unfortunately, there are still regressions for the common, > single threaded, Level Trigger use case. > > Things changed/removed: > > * ep->ovflist - is no longer needed, the main ready list continues > to be appended to while we iterate through the transaction list. > > * ep_scan_ready_list - not enough generic code between users > anymore to warrant this. ep_poll_readyevents_proc (used for > poll) is read-only, using __wfcq_for_each, while > ep_send_events (used for epoll_wait) dequeues and needs > __wfcq_for_each_safe. > > * ep->lock renamed to ep->wqlock; this only protects waitqueues now > (we use trylock to further avoid redundant wakeups) > > * EPOLL_CTL_DEL/close() on a ready file will not immediately release > epitem memory, epoll_wait() must be called since there's no way to > delete a ready item from wfcqueue in O(1) time. In practice this > should not be a problem, any valid app using epoll must call > epoll_wait occasionally. Unfreed epitems still count against > max_user_watches to protect against local DoS. This should be the > only possibly-noticeable change (in case there's an app that blindly > adds/deletes things from the rbtree but never calls epoll_wait) > > Changes since v1: > * fixed memory leak with pre-existing zombies in ep_free > * updated to use the latest wfcqueue.h APIs > * switched to using __wfcq_splice and a global transaction list > (this is like the old txlist in ep_scan_ready_list) > * redundant wakeups avoided in ep_notify_waiters: > - only attempt a wakeup when an item is enqueued the first time > - use spin_trylock_irqsave when attempting notification, since a > failed lock means either another task is already waking, or > ep_poll is already running and will check anyways upon releasing > wqlock, anyways. > * explicitly cache-aligned rdltail in SMP > * added ep_item_state for atomically reading epi->state with barrier > (avoids WARN_ON in ep_send_events) > * reverted epi->nwait removal, it was not necessary > sizeof(epitem) is still <= 128 bytes on 64-bit machines > > Changes since v2: > * epi->state is no longer atomic, we only cmpxchg in ep_poll_callback > now and rely on implicit barriers in other places for reading. > * intermediate EP_STATE_DEQUEUE removed, this (with xchg) caused too > much overhead in the ep_send_events loop and could not eliminate > starvation dangers from improper EPOLLET usage (the original code > had this problem, too, the window is just a few cycles larger, now). > * minor code cleanups > > Lightly-tested-by: Eric Wong <normalperson@xxxxxxxx> > Cc: Davide Libenzi <davidel@xxxxxxxxxxxxxxx> > Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Cc: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> > --- > ... > @@ -967,8 +951,6 @@ static struct epitem *ep_find(struct eventpoll *ep, struct file *file, int fd) > */ > static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key) > { > - int pwake = 0; > - unsigned long flags; > struct epitem *epi = ep_item_from_wait(wait); > struct eventpoll *ep = epi->ep; > > @@ -983,7 +965,8 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k > list_del_init(&wait->task_list); > } > > - spin_lock_irqsave(&ep->lock, flags); > + /* pairs with smp_mb in ep_modify */ > + smp_rmb(); > > /* > * If the event mask does not contain any poll(2) event, we consider the > @@ -992,7 +975,7 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k > * until the next EPOLL_CTL_MOD will be issued. > */ > if (!(epi->event.events & ~EP_PRIVATE_BITS)) > - goto out_unlock; > + return 1; > > /* > * Check the events coming with the callback. At this stage, not > @@ -1001,52 +984,14 @@ static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *k > * test for "key" != NULL before the event match test. > */ > if (key && !((unsigned long) key & epi->event.events)) > - goto out_unlock; > - > - /* > - * If we are transferring events to userspace, we can hold no locks > - * (because we're accessing user memory, and because of linux f_op->poll() > - * semantics). All the events that happen during that period of time are > - * chained in ep->ovflist and requeued later on. > - */ > - if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) { > - if (epi->next == EP_UNACTIVE_PTR) { > - epi->next = ep->ovflist; > - ep->ovflist = epi; > - if (epi->ws) { > - /* > - * Activate ep->ws since epi->ws may get > - * deactivated at any time. > - */ > - __pm_stay_awake(ep->ws); > - } > + return 1; > > - } > - goto out_unlock; > - } > - > - /* If this file is already in the ready list we exit soon */ > - if (!ep_is_linked(&epi->rdllink)) { > - list_add_tail(&epi->rdllink, &ep->rdllist); > + if (ep_mark_ready(epi)) { > + wfcq_enqueue(&ep->rdlhead, &ep->rdltail, &epi->rdllink); > ep_pm_stay_awake_rcu(epi); > + ep_notify_waiters(ep); > } > > - /* > - * Wake up ( if active ) both the eventpoll wait list and the ->poll() > - * wait list. > - */ > - if (waitqueue_active(&ep->wq)) > - wake_up_locked(&ep->wq); > - if (waitqueue_active(&ep->poll_wait)) > - pwake++; > - > -out_unlock: > - spin_unlock_irqrestore(&ep->lock, flags); > - > - /* We have to call this outside the lock */ > - if (pwake) > - ep_poll_safewake(&ep->poll_wait); > - > return 1; > } > ... > -static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head, > - void *priv) > +static int ep_send_events(struct eventpoll *ep, > + struct epoll_event __user *uevent, int maxevents) > { > - struct ep_send_events_data *esed = priv; > - int eventcnt; > + int eventcnt = 0; > unsigned int revents; > struct epitem *epi; > - struct epoll_event __user *uevent; > struct wakeup_source *ws; > + struct wfcq_node *node, *n; > + enum epoll_item_state state; > poll_table pt; > > init_poll_funcptr(&pt, NULL); > > /* > - * We can loop without lock because we are passed a task private list. > - * Items cannot vanish during the loop because ep_scan_ready_list() is > - * holding "mtx" during this call. > + * Items cannot vanish during the loop because we are holding > + * "mtx" during this call. > */ > - for (eventcnt = 0, uevent = esed->events; > - !list_empty(head) && eventcnt < esed->maxevents;) { > - epi = list_first_entry(head, struct epitem, rdllink); > + __wfcq_for_each_safe(&ep->txlhead, &ep->txltail, node, n) { > + epi = ep_item_from_node(node); > + __wfcq_dequeue(&ep->txlhead, &ep->txltail); > + > + /* no barrier needed, splicing txl should be enough */ > + state = epi->state; > + > + if (state == EP_STATE_ZOMBIE) { > + ep_reap(ep, epi); > + continue; > + } > + WARN_ON(state != EP_STATE_READY); > + wfcq_node_init(&epi->rdllink); > > /* > * Activate ep->ws before deactivating epi->ws to prevent Does anything deactivate ep->ws now? > @@ -1459,7 +1394,7 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head, > * > * This could be rearranged to delay the deactivation of epi->ws > * instead, but then epi->ws would temporarily be out of sync > - * with ep_is_linked(). > + * with epi->state. > */ > ws = ep_wakeup_source(epi); > if (ws) { > @@ -1468,25 +1403,29 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head, > __pm_relax(ws); > } > > - list_del_init(&epi->rdllink); > - > revents = ep_item_poll(epi, &pt); > > /* > * If the event mask intersect the caller-requested one, > - * deliver the event to userspace. Again, ep_scan_ready_list() > + * deliver the event to userspace. Again, ep_poll() > * is holding "mtx", so no operations coming from userspace > * can change the item. > */ > if (revents) { > if (__put_user(revents, &uevent->events) || > __put_user(epi->event.data, &uevent->data)) { > - list_add(&epi->rdllink, head); > + wfcq_enqueue(&ep->txlhead, &ep->txltail, > + &epi->rdllink); > ep_pm_stay_awake(epi); > - return eventcnt ? eventcnt : -EFAULT; > + if (!eventcnt) > + eventcnt = -EFAULT; > + break; > } > - eventcnt++; > + > uevent++; > + if (++eventcnt == maxevents) > + n = NULL; /* stop iteration */ > + > if (epi->event.events & EPOLLONESHOT) > epi->event.events &= EP_PRIVATE_BITS; > else if (!(epi->event.events & EPOLLET)) { > @@ -1495,32 +1434,25 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head, > * Trigger mode, we need to insert back inside > * the ready list, so that the next call to > * epoll_wait() will check again the events > - * availability. At this point, no one can insert > - * into ep->rdllist besides us. The epoll_ctl() > - * callers are locked out by > - * ep_scan_ready_list() holding "mtx" and the > - * poll callback will queue them in ep->ovflist. > + * availability. > */ > - list_add_tail(&epi->rdllink, &ep->rdllist); > + wfcq_enqueue(&ep->rdlhead, &ep->rdltail, > + &epi->rdllink); > ep_pm_stay_awake(epi); > + continue; > } > } > + > + /* > + * reset item state for EPOLLONESHOT and EPOLLET > + * no barrier here, rely on ep->mtx release for write barrier > + */ What happens if ep_poll_callback runs before you set epi->state below? It used to queue on ep->ovflist and call __pm_stay_awake on ep->ws, but now it does not appear to do anything. > + epi->state = EP_STATE_IDLE; > } > > return eventcnt; > } > ... -- Arve Hjønnevåg -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html