On Wed, Dec 08, 2021 at 04:46:06PM +0100, Miklos Szeredi wrote:
On Tue, 7 Dec 2021 at 15:25, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
FIFO means the thread used longest ago gets to go first. If your threads
are an idempotent workers, FIFO might not be the best option. But I'm
not much familiar with the FUSE code or it's design.
Okay. Did some experiments, but couldn't see
wake_up_interruptible_sync() actually migrate the woken task, the
behavior was identical to wake_up_interruptible(). I guess this is
the "less" part in "more or less", but it would be good to see more
clearly what is happening.
I'll try to describe the design to give more context:
- FUSE is similar to network filesystem in that there's a server and a
client, except both are on the same host. The client lives in the
kernel and the server lives in userspace.
- Communication between them is done with read and write syscalls.
- Usually the server has multiple threads. When a server thread is
idle it is blocking in sys_read -> ... -> fuse_dev_do_read ->
wait_event_interruptible_exclusive(fiq->waitq,...).
- When a filesystem request comes in (e.g. mkdir) a request is
constructed, put on the input queue (fiq->pending) and fiq->waitq
woken up. After this the client task goes to sleep in
request_wait_answer -> wait_event_interruptible(req->waitq, ...).
- The server thread takes the request off the pending list, copies the
data to the userspace buffer and puts the request on the processing
list.
- The userspace part interprets the read buffer, performs the fs
operation, and writes the reply.
- During the write(2) the reply is now copied to the kernel and the
request is looked up on the processing list. The client is woken up
through req->waitq. After returning from write(2) the server thread
again calls read(2) to get the next request.
- After being woken up, the client task now returns with the result of
the operation.
- The above example is for synchronous requests. There are async
requests like readahead or buffered writes. In that case the client
does not call request_wait_answer() but returns immediately and the
result is processed from the server thread using a callback function
(req->args->end()).
From a scheduling prospective it would be ideal if the server thread's
CPU was matched to the client thread's CPU, since that would make the
data stay local, and for synchronous requests a _sync type wakeup is
perfect, since the client goes to sleep just as the server starts
processing and vice versa.
Always migrating the woken server thread to the client's CPU is not
going to be good, since this would result in too many migrations and
would loose locality for the server's stack.
Another idea is to add per-cpu input queues. The client then would
queue the request on the pending queue corresponding to its CPU and
wake up the server thread blocked on that queue.
What happens though if this particular queue has no servers? Or if a
queue is starved because it's served by less threads than another?
Handing these cases seems really complicated.
Per-CPU input queue is a great idea. It was a key concept in ZUFS[1]
design. However, describing it as complicated would be an understatement,
to say the least.
Is there a simper way?
Here is an idea: hint the userspace server on which cpu to execute the
_next_ request via the (unused) fuse_in_header.padding field in current
request. Naturally, this method would be effective only for cases where
queue-size > 1.
Thanks,
Miklos
[1] https://lwn.net/Articles/795996/