Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write

Manfred Spraul <manfred@xxxxxxxxxxxxxxxx> · Fri, 27 Dec 2024 19:39:06 +0100

Hi,
(I had to remove many cc, my mail server rejected to send)

On 12/26/24 9:11 PM, Oleg Nesterov wrote:
I mostly agree, see my reply to this patch, but...

On 12/26, Linus Torvalds wrote:
If the optimization is correct, there is no point to having a config option.

If the optimization is incorrect, there is no point to having the code.

Either way, there's no way we'd ever have a config option for this.
Agreed,

+       if (was_full && pipe_check_wq_has_sleeper(&pipe->wr_wait))
                 wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
End result: you need to explain why the race cannot exist.
Agreed,

And I think your patch is simply buggy
Agreed again, but probably my reasoning was wrong,

But now waiters use "wait_event_interruptible_exclusive()" explicitly
outside the pipe mutex, so the waiters and wakers aren't actually
serialized.
This is what I don't understand... Could you spell ?
I _think_ that

	wait_event_whatever(WQ, CONDITION);
vs

	CONDITION = 1;
	if (wq_has_sleeper(WQ))
		wake_up_xxx(WQ, ...);
	
is fine.

This pattern is documented in wait.h:

https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/wait.h#L96

Thus if there an issue, then the documentation should be updated.

Both wq_has_sleeper() and wait_event_whatever()->prepare_to_wait_event()
have the necessary barriers to serialize the waiters and wakers.

Damn I am sure I missed something ;)

Actually:

Does this work universally? I.e. can we add the optimization into 
__wake_up()?


But I do not understand this comment (from 2.6.0)

https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/kernel/fork.c?h=v2.6.0&id=e220fdf7a39b54a758f4102bdd9d0d5706aa32a7

/* * Note: we use "set_current_state()" _after_ the wait-queue add, * 
because we need a memory barrier there on SMP, so that any * 
wake-function that tests for the wait-queue being active * will be 
guaranteed to see waitqueue addition _or_ subsequent * tests in this 
thread will see the wakeup having taken place. * * The spin_unlock() 
itself is semi-permeable and only protects * one way (it only protects 
stuff inside the critical region and * stops them from bleeding out - 
it would still allow subsequent * loads to move into the the critical 
region). */ voidprepare_to_wait 
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/prepare_to_wait>(wait_queue_head_t 
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait_queue_head_t>*q,wait_queue_t 
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait_queue_t>*wait 
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>,intstate) { 
unsignedlongflags; wait 
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>->flags&=~WQ_FLAG_EXCLUSIVE 
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/WQ_FLAG_EXCLUSIVE>; 
spin_lock_irqsave 
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/spin_lock_irqsave>(&q->lock,flags); 
if(list_empty 
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/list_empty>(&wait 
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>->task_list 
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/task_list>)) 
__add_wait_queue 
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/__add_wait_queue>(q,wait 
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>); 
set_current_state 
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/set_current_state>(state); 
spin_unlock_irqrestore 
<https://elixir.bootlin.com/linux/v2.6.0/C/ident/spin_unlock_irqrestore>(&q->lock,flags); 
}
set_current_state() now uses smp_store_mb(), which is a memory barrier 
_after_ the store. Thus I do not see what enforces that the store 
happens before the store for the __add_wait_queue().

--

    Manfred
From 7cb8f06ab022e7fc36bacbe65f654822147ec1aa Mon Sep 17 00:00:00 2001
From: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
Date: Fri, 27 Dec 2024 18:21:41 +0100
Subject: [PATCH] __wake_up_common_lock: Optimize for empty wake queues

wake_up() must wake up every task that is in the wait queue.
But: If there is concurrency between wake_up() and add_wait_queue(),
then (as long as we are not violating the memory ordering rules) we
can just claim that wake_up() happened "first" and add_wait_queue()
happened afterwards.
From memory ordering perspective:
- If the wait_queue() is empty, then wake_up() just does
  spin_lock();
  list_empty()
  spin_unlock();
- add_wait_queue()/prepare_to_wait() do all kind of operations,
  but they may become visible only when the spin_unlock_irqrestore()
  is done.
Thus, instead of pairing the memory barrier to the spinlock, and
thus writing to a potentially shared cacheline, we load-acquire
the next pointer from the list.

Risks and side effects:
- The guaranteed memory barrier of wake_up() is reduced to load_acquire.
  Previously, there was always a spin_lock()/spin_unlock() pair.
- prepare_to_wait() actually does two operations under spinlock:
  It adds current to the wait queue, and it calls set_current_state().
  The comment above prepare_to_wait() is not clear to me, thus there
  might be further side effects.

Only lightly tested.
No benchmark test done.
---
 kernel/sched/wait.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 51e38f5f4701..10d02f652ab8 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -124,6 +124,23 @@ static int __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int m
 int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
 	      int nr_exclusive, void *key)
 {
+	if (list_empty(&wq_head->head)) {
+		struct list_head *pn;
+
+		/*
+		 * pairs with spin_unlock_irqrestore(&wq_head->lock);
+		 * We actually do not need to acquire wq_head->lock, we just
+		 * need to be sure that there is no prepare_to_wait() that
+		 * completed on any CPU before __wake_up was called.
+		 * Thus instead of load_acquiring the spinlock and dropping
+		 * it again, we load_acquire the next list entry and check
+		 * that the list is not empty.
+		 */
+		pn = smp_load_acquire(&wq_head->head.next);
+
+		if(pn == &wq_head->head)
+			return 0;
+	}
 	return __wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key);
 }
 EXPORT_SYMBOL(__wake_up);
-- 
2.47.1