On 2019-10-07 20:43, Jason Baron wrote:
On 10/7/19 2:30 PM, Roman Penyaev wrote:
On 2019-10-07 18:42, Jason Baron wrote:
On 10/7/19 6:54 AM, Roman Penyaev wrote:
On 2019-10-03 18:13, Jason Baron wrote:
On 9/30/19 7:55 AM, Roman Penyaev wrote:
On 2019-09-28 04:29, Andrew Morton wrote:
On Wed, 25 Sep 2019 09:56:03 +0800 hev <r@xxxxxx> wrote:
From: Heiher <r@xxxxxx>
Take the case where we have:
t0
| (ew)
e0
| (et)
e1
| (lt)
s0
t0: thread 0
e0: epoll fd 0
e1: epoll fd 1
s0: socket fd 0
ew: epoll_wait
et: edge-trigger
lt: level-trigger
We only need to wakeup nested epoll fds if something has been
queued
to the
overflow list, since the ep_poll() traverses the rdllist during
recursive poll
and thus events on the overflow list may not be visible yet.
Test code:
Look sane to me. Do you have any performance testing results
which
show a benefit?
epoll maintainership isn't exactly a hive of activity nowadays :(
Roman, would you please have time to review this?
So here is my observation: current patch does not fix the
described
problem (double wakeup) for the case, when new event comes exactly
to the ->ovflist. According to the patch this is the desired
intention:
/*
* We only need to wakeup nested epoll fds if something has
been
queued
* to the overflow list, since the ep_poll() traverses the
rdllist
* during recursive poll and thus events on the overflow list
may
not be
* visible yet.
*/
if (nepi != NULL)
pwake++;
....
if (pwake == 2)
ep_poll_safewake(&ep->poll_wait);
but this actually means that we repeat the same behavior (double
wakeup)
but only for the case, when event comes to the ->ovflist.
How to reproduce? Can be easily done (ok, not so easy but it is
possible
to try): to the given userspace test we need to add one more
socket
and
immediately fire the event:
e.events = EPOLLIN;
if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0)
goto out;
/*
* Signal any fd to let epoll_wait() to call
ep_scan_ready_list()
* in order to "catch" it there and add new event to
->ovflist.
*/
if (write(s2fd[1], "w", 1) != 1)
goto out;
That is done in order to let the following epoll_wait() call to
invoke
ep_scan_ready_list(), where we can "catch" and insert new event
exactly
to the ->ovflist. In order to insert event exactly in the correct
list
I introduce artificial delay.
Modified test and kernel patch is below. Here is the output of
the
testing tool with some debug lines from kernel:
# ~/devel/test/edge-bug
[ 59.263178] ### sleep 2
>> write to sock
[ 61.318243] ### done sleep
[ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait);
events_in_rdllist=1, events_in_ovflist=1
[ 61.321204] ### sleep 2
[ 63.398325] ### done sleep
error: What?! Again?!
First epoll_wait() call (ep_scan_ready_list()) observes 2 events
(see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we
wanted to achieve, so eventually ep_poll_safewake() is called
again
which leads to double wakeup.
In my opinion current patch as it is should be dropped, it does
not
fix the described problem but just hides it.
--
Hi Jason,
Yes, there are 2 wakeups in the test case you describe, but if the
second event (write to s1fd) gets queued after the first call to
epoll_wait(), we are going to get 2 wakeups anyways.
Yes, exactly, for this reason I print out the number of events
observed
on first wait, there should be 1 (rdllist) and 1 (ovflist),
otherwise
this is another case, when second event comes exactly after first
wait, which is legitimate wakeup.
So yes, there may
be a slightly bigger window with this patch for 2 wakeups, but its
small
and I tried to be conservative with the patch - I'd rather get an
occasional 2nd wakeup then miss one. Trying to debug missing
wakeups
isn't always fun...
That said, the reason for propagating events that end up on the
overflow
list was to prevent the race of the wakee not seeing events because
they
were still on the overflow list. In the testcase, imagine if there
was a
thread doing epoll_wait() on efd[0], and then a write happends on
s1fd.
I thought it was possible then that a 2nd thread doing epoll_wait()
on
efd[1], wakes up, checks efd[0] and sees no events because they are
still potentially on the overflow list. However, I think that case
is
not possible because the thread doing epoll_wait() on efd[0] is
going to
have the ep->mtx, and thus when the thread wakes up on efd[1], its
going
to have to be ordered because its also grabbing the ep->mtx
associated
with efd[0].
So I think its safe to do the following if we want to go further
than
the proposed patch, which is what you suggested earlier in the
thread
(minus keeping the wakeup on ep->wq).
Then I do not understand why we need to keep ep->wq wakeup?
@wq and @poll_wait are almost the same with only one difference:
@wq is used when you sleep on it inside epoll_wait() and the other
is used when you nest epoll fd inside epoll fd. Either you wake
both up either you don't this at all.
ep_poll_callback() does wakeup explicitly, ep_insert() and
ep_modify()
do wakeup explicitly, so what are the cases when we need to do
wakeups
from ep_scan_ready_list()?
Hi Roman,
So the reason I was saying not to drop the ep->wq wakeup was that I
was
thinking about a usecase where you have multi-threads say thread A
and
thread B which are doing epoll_wait() on the same epfd. Now, the
threads
both call epoll_wait() and are added as exclusive to ep->wq. Now a
bunch
of events happen and thread A is woken up. However, thread A may only
process a subset of the events due to its 'maxevents' parameter. In
that
case, I was thinking that the wakeup on ep->wq might be helpful,
because
in the absence of subsequent events, thread B can now start
processing
the rest, instead of waiting for the next event to be queued.
However, I was thinking about the state of things before:
86c0517 fs/epoll: deal with wait_queue only once
Before that patch, thread A would have been removed from eq->wq
before
the wakeup call, thus waking up thread B. However, now that thread A
stays on the queue during the call to ep_send_events(), I believe the
wakeup would only target thread A, which doesn't help since its
already
checking for events. So given the state of things I think you are
right
in that its not needed. However, I wonder if not removing from the
ep->wq affects the multi-threaded case I described. Its been around
since 5.0, so probably not, but it would be a more subtle performance
difference.
Now I understand what you mean. You want to prevent "idling" of
events,
while thread A is busy with the small portion of events (portion is
equal
to maxevents). On next iteration thread A will pick up the rest, no
doubts,
but would be nice to give a chance to thread B immediately to deal
with the
rest. Ok, makes sense.
Exactly, I don't believe its racy as is - but it seems like it would be
good to wakeup other threads that may be waiting. That said, this logic
was removed as I pointed out. So I'm not sure we need to tie this
change
to the current one - but it may be a nice addition.
But what if to make this wakeup explicit if we have more events to
process?
(nothing is tested, just a guess)
@@ -255,6 +255,7 @@ struct ep_pqueue {
struct ep_send_events_data {
int maxevents;
struct epoll_event __user *events;
+ bool have_more;
int res;
};
@@ -1783,14 +1768,17 @@ static __poll_t ep_send_events_proc(struct
eventpoll *ep, struct list_head *head
}
static int ep_send_events(struct eventpoll *ep,
- struct epoll_event __user *events, int
maxevents)
+ struct epoll_event __user *events, int
maxevents,
+ bool *have_more)
{
- struct ep_send_events_data esed;
-
- esed.maxevents = maxevents;
- esed.events = events;
+ struct ep_send_events_data esed = {
+ .maxevents = maxevents,
+ .events = events,
+ };
ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false);
+ *have_more = esed.have_more;
+
return esed.res;
}
@@ -1827,7 +1815,7 @@ static int ep_poll(struct eventpoll *ep, struct
epoll_event __user *events,
{
int res = 0, eavail, timed_out = 0;
u64 slack = 0;
- bool waiter = false;
+ bool waiter = false, have_more;
wait_queue_entry_t wait;
ktime_t expires, *to = NULL;
@@ -1927,7 +1915,8 @@ static int ep_poll(struct eventpoll *ep, struct
epoll_event __user *events,
* more luck.
*/
if (!res && eavail &&
- !(res = ep_send_events(ep, events, maxevents)) &&
!timed_out)
+ !(res = ep_send_events(ep, events, maxevents, &have_more))
&&
+ !timed_out)
goto fetch_events;
if (waiter) {
@@ -1935,6 +1924,12 @@ static int ep_poll(struct eventpoll *ep, struct
epoll_event __user *events,
__remove_wait_queue(&ep->wq, &wait);
spin_unlock_irq(&ep->wq.lock);
}
+ /*
+ * We were not able to process all the events, so immediately
+ * wakeup other waiter.
+ */
+ if (res > 0 && have_more && waitqueue_active(&ep->wq))
+ wake_up(&ep->wq);
return res;
}
Ok, yeah I like making it explicit. Looks like you are missing the
changes to ep_scan_ready_list(), but I think the general approach makes
sense.
Yeah, missed the hunk:
@@ -1719,8 +1704,10 @@ static __poll_t ep_send_events_proc(struct
eventpoll *ep, struct list_head *head
lockdep_assert_held(&ep->mtx);
list_for_each_entry_safe(epi, tmp, head, rdllink) {
- if (esed->res >= esed->maxevents)
+ if (esed->res >= esed->maxevents) {
+ esed->have_more = true;
break;
+ }
Although I really didn't have a test case that motivated this -
its just was sort of noting this change in behavior while reviewing the
current change.
PS. So what we decide with the original patch? Remove the whole
branch?
For fwiw, I'm ok removing the whole branch as you proposed.
Then probably Heiher could resend once more. Heiher, can you, please?
And I think the above change can go in separately (if we decide we want
it). I don't
think they need to be tied together. I also want to make sure this
change gets a full linux-next cycle, so I think it should target 5.5 at
this point.
Sure, this explicit ->wq wakeup is a separate patch, which should be
covered with some benchmarks. I can try to cook something out in order
to get numbers.
--
Roman