Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups

Peter Oskolkov <posk@xxxxxxxxxx> · Wed, 15 Dec 2021 13:04:33 -0800

On Wed, Dec 15, 2021 at 10:25 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Wed, Dec 15, 2021 at 09:56:06AM -0800, Peter Oskolkov wrote:
> > On Wed, Dec 15, 2021 at 2:06 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > >  /*
> > > + * Enqueue tsk to it's server's runnable list and wake the server for pickup if
> > > + * so desired. Notable LAZY workers will not wake the server and rely on the
> > > + * server to do pickup whenever it naturally runs next.
> >
> > No, I never suggested we needed per-server runnable queues: in all my
> > patchsets I had a single list of idle (runnable) workers.
>
> This is not about the idle servers..
>
> So without the LAZY thing on, a previously blocked task hitting sys_exit
> will enqueue itself on the runnable list and wake the server for pickup.

How can a blocked task hit sys_exit()? Shouldn't it be RUNNING?

Anyway, servers and workers are supposed to unregister before exiting,
so if they call sys_exit() they break the agreement; in my patch I
just clear all umcg-related state and proceed, without waking the
server: the user broke the protocol, let them figure out what
happened:

+static void umcg_clear_task(struct task_struct *tsk)
+{
+ /*
+ * This is either called for the current task, or for a newly forked
+ * task that is not yet running, so we don't need strict atomicity
+ * below.
+ */
+ if (tsk->umcg_task) {
+ WRITE_ONCE(tsk->umcg_task, NULL);
+
+ /* These can be simple writes - see the comment above. */
+ tsk->pinned_umcg_worker_page = NULL;
+ tsk->pinned_umcg_server_page = NULL;
+ tsk->flags &= ~PF_UMCG_WORKER;
+ }
+}
+
+/* Called both by normally (unregister) and abnormally exiting workers. */
+void umcg_handle_exiting_worker(void)
+{
+ umcg_unpin_pages();
+ umcg_clear_task(current);
+}

>
> IIRC you didn't like the server waking while it was still running
> another task, but instead preferred to have it pick up the newly
> enqueued task when next it ran.

Yes, this is the model I have, as I outlined in another email. I
understand that having queues per-CPU/per-server is how it is done in
the kernel, both for historical reasons (before multiprocessing there
was a single queue/cpu) and for throughput (per-cpu runqueues are
individually faster than a global one). However, this model is known
to lag in presence of load spikes (long per-cpu queues with some CPUs
idle), and is not really easy to work with given the use cases this
whole userspace scheduling effort is trying to address: multiple
priorities and work isolation: these are easy to address directly with
a scheduler that has a global view rather than multiple
per-cpu/per-server schedulers/queues that try to coordinate.

I can even claim (without proof, just a hunch, based on how I would
code this) that strict scheduling policies around priority and
isolation (e.g. never run work item A if work item B becomes runnable,
unless work item A is already running) cannot be enforced without a
global scheduler, so per-cpu/per-server queues do not really fit the
use case here...

>
> LAZY enables that.. *however* it does need to wake the server when it is
> idle, otherwise they'll all sit there waiting for one another.

If all servers are busy running workers, then it is not up to the
kernel to "preempt" them in my model: the userspace can set up another
thread/task to preempt a misbehaving worker, which will wake the
server attached to it. But in practice there are always workers
blocking in the kernel, which wakes their servers, which then reap the
woken/runnable workers list, so well-behaving code does not need this.
Yes, sometimes the code does not behave well, e.g. a worker grabs a
spinlock, blocks in the kernel, its server runs another worker that
starts spinning on the spinlock; but this is fixable by making the
spinlock aware of our stuff: either the worker who got the lock is
marked as LOCKED and so does not release its server (one of the
reasons I have this flag), or the lock itself becomes sleepable (e.g.
after spinning a bit it calls into a futex wait).

And so we need to figure out this high-level thing first: do we go
with the per-server worker queues/lists, or do we go with the approach
I use in my patchset? It seems to me that the kernel-side code in my
patchset is not more complicated than your patchset is shaping up to
be, and some things are actually easier to accomplish, like having a
single idle_server_ptr vs this LAZY and/or server "preemption"
behavior that you have.

Again, I'm OK with having it your way if all needed features are
covered, but I think we should be explicit about why
per-server/per-cpu model is chosen vs the one I proposed, especially
as it seems the kernel side code is not really simpler in the end.