Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2018-12-04 18:23, Jason Baron wrote:
On 12/3/18 6:02 AM, Roman Penyaev wrote:

[...]


 	ep_set_busy_poll_napi_id(epi);

@@ -1156,8 +1187,8 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 	 */
 	if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
 		if (epi->next == EP_UNACTIVE_PTR) {
-			epi->next = ep->ovflist;
-			ep->ovflist = epi;
+			/* Atomically exchange tail */
+			epi->next = xchg(&ep->ovflist, epi);

This also relies on the fact that the same epi can't be added to the
list in parallel, because the wait queue doing the wakeup will have the
wait_queue_head lock. That was a little confusing for me b/c we only had
the read_lock() above.

Yes, that is indeed not obvious path, but wq is locked by wake_up_*_poll()
call or caller of wake_up_locked_poll() has to be sure wq.lock is taken.

I'll add an explicit comment here, thanks for pointing out.


 			if (epi->ws) {
 				/*
 				 * Activate ep->ws since epi->ws may get
@@ -1172,7 +1203,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v

 	/* If this file is already in the ready list we exit soon */
 	if (!ep_is_linked(epi)) {
-		list_add_tail(&epi->rdllink, &ep->rdllist);
+		list_add_tail_lockless(&epi->rdllink, &ep->rdllist);
 		ep_pm_stay_awake_rcu(epi);
 	}

same for this.

... and an explicit comment here.



@@ -1197,13 +1228,13 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 				break;
 			}
 		}
-		wake_up_locked(&ep->wq);
+		wake_up(&ep->wq);

why the switch here to the locked() variant? Shouldn't the 'reader'
side, in this case which takes the rwlock for write see all updates in a
coherent state at this point?

lockdep inside __wake_up_common expects wq_head->lock is taken, and
seems this is not a good idea to leave wq naked on wake up path,
when several CPUs can enter wake function. Although default_wake_function
is protected by spinlock inside try_to_wake_up(), but for example
autoremove_wake_function() can't be called concurrently for the same wq
(it removes wq entry from the list).  Also in case of bookmarks
__wake_up_common adds an entry to the list, thus can't be called without
any locks.

I understand you concern and you are right saying that read side sees
wq entries as stable, but that will work only if __wake_up_common does
not modify anything, that is seems true now, but of course it is
too scary to rely on that in the future.


 	}
 	if (waitqueue_active(&ep->poll_wait))
 		pwake++;

 out_unlock:
-	spin_unlock_irqrestore(&ep->wq.lock, flags);
+	read_unlock_irqrestore(&ep->lock, flags);

 	/* We have to call this outside the lock */
 	if (pwake)
@@ -1489,7 +1520,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 		goto error_remove_epi;

/* We have to drop the new item inside our item list to keep track of it */
-	spin_lock_irq(&ep->wq.lock);
+	write_lock_irq(&ep->lock);

 	/* record NAPI ID of new item if present */
 	ep_set_busy_poll_napi_id(epi);
@@ -1501,12 +1532,12 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,

 		/* Notify waiting tasks that events are available */
 		if (waitqueue_active(&ep->wq))
-			wake_up_locked(&ep->wq);
+			wake_up(&ep->wq);

is this necessary to switch as well? Is this to make lockdep happy?
Looks like there are few more conversions too below...

Yes, necessary, wq.lock should be taken.

--
Roman




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux