Hi,
(I had to remove many cc, my mail server rejected to send)
On 12/26/24 9:11 PM, Oleg Nesterov wrote:
I mostly agree, see my reply to this patch, but...
On 12/26, Linus Torvalds wrote:
If the optimization is correct, there is no point to having a config option.
If the optimization is incorrect, there is no point to having the code.
Either way, there's no way we'd ever have a config option for this.
Agreed,
+ if (was_full && pipe_check_wq_has_sleeper(&pipe->wr_wait))
wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
End result: you need to explain why the race cannot exist.
Agreed,
And I think your patch is simply buggy
Agreed again, but probably my reasoning was wrong,
But now waiters use "wait_event_interruptible_exclusive()" explicitly
outside the pipe mutex, so the waiters and wakers aren't actually
serialized.
This is what I don't understand... Could you spell ?
I _think_ that
wait_event_whatever(WQ, CONDITION);
vs
CONDITION = 1;
if (wq_has_sleeper(WQ))
wake_up_xxx(WQ, ...);
is fine.
This pattern is documented in wait.h:
https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/wait.h#L96
Thus if there an issue, then the documentation should be updated.
Both wq_has_sleeper() and wait_event_whatever()->prepare_to_wait_event()
have the necessary barriers to serialize the waiters and wakers.
Damn I am sure I missed something ;)
Actually:
Does this work universally? I.e. can we add the optimization into
__wake_up()?
But I do not understand this comment (from 2.6.0)
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/kernel/fork.c?h=v2.6.0&id=e220fdf7a39b54a758f4102bdd9d0d5706aa32a7
/* * Note: we use "set_current_state()" _after_ the wait-queue add, *
because we need a memory barrier there on SMP, so that any *
wake-function that tests for the wait-queue being active * will be
guaranteed to see waitqueue addition _or_ subsequent * tests in this
thread will see the wakeup having taken place. * * The spin_unlock()
itself is semi-permeable and only protects * one way (it only protects
stuff inside the critical region and * stops them from bleeding out -
it would still allow subsequent * loads to move into the the critical
region). */ voidprepare_to_wait
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/prepare_to_wait>(wait_queue_head_t
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait_queue_head_t>*q,wait_queue_t
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait_queue_t>*wait
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>,intstate) {
unsignedlongflags; wait
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>->flags&=~WQ_FLAG_EXCLUSIVE
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/WQ_FLAG_EXCLUSIVE>;
spin_lock_irqsave
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/spin_lock_irqsave>(&q->lock,flags);
if(list_empty
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/list_empty>(&wait
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>->task_list
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/task_list>))
__add_wait_queue
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/__add_wait_queue>(q,wait
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>);
set_current_state
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/set_current_state>(state);
spin_unlock_irqrestore
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/spin_unlock_irqrestore>(&q->lock,flags);
}
set_current_state() now uses smp_store_mb(), which is a memory barrier
_after_ the store. Thus I do not see what enforces that the store
happens before the store for the __add_wait_queue().
--
Manfred
From 7cb8f06ab022e7fc36bacbe65f654822147ec1aa Mon Sep 17 00:00:00 2001
From: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
Date: Fri, 27 Dec 2024 18:21:41 +0100
Subject: [PATCH] __wake_up_common_lock: Optimize for empty wake queues
wake_up() must wake up every task that is in the wait queue.
But: If there is concurrency between wake_up() and add_wait_queue(),
then (as long as we are not violating the memory ordering rules) we
can just claim that wake_up() happened "first" and add_wait_queue()
happened afterwards.
From memory ordering perspective:
- If the wait_queue() is empty, then wake_up() just does
spin_lock();
list_empty()
spin_unlock();
- add_wait_queue()/prepare_to_wait() do all kind of operations,
but they may become visible only when the spin_unlock_irqrestore()
is done.
Thus, instead of pairing the memory barrier to the spinlock, and
thus writing to a potentially shared cacheline, we load-acquire
the next pointer from the list.
Risks and side effects:
- The guaranteed memory barrier of wake_up() is reduced to load_acquire.
Previously, there was always a spin_lock()/spin_unlock() pair.
- prepare_to_wait() actually does two operations under spinlock:
It adds current to the wait queue, and it calls set_current_state().
The comment above prepare_to_wait() is not clear to me, thus there
might be further side effects.
Only lightly tested.
No benchmark test done.
---
kernel/sched/wait.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 51e38f5f4701..10d02f652ab8 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -124,6 +124,23 @@ static int __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int m
int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, void *key)
{
+ if (list_empty(&wq_head->head)) {
+ struct list_head *pn;
+
+ /*
+ * pairs with spin_unlock_irqrestore(&wq_head->lock);
+ * We actually do not need to acquire wq_head->lock, we just
+ * need to be sure that there is no prepare_to_wait() that
+ * completed on any CPU before __wake_up was called.
+ * Thus instead of load_acquiring the spinlock and dropping
+ * it again, we load_acquire the next list entry and check
+ * that the list is not empty.
+ */
+ pn = smp_load_acquire(&wq_head->head.next);
+
+ if(pn == &wq_head->head)
+ return 0;
+ }
return __wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key);
}
EXPORT_SYMBOL(__wake_up);
--
2.47.1