On Wed, Dec 15, 2021 at 10:25 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Wed, Dec 15, 2021 at 09:56:06AM -0800, Peter Oskolkov wrote: > > On Wed, Dec 15, 2021 at 2:06 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > > /* > > > + * Enqueue tsk to it's server's runnable list and wake the server for pickup if > > > + * so desired. Notable LAZY workers will not wake the server and rely on the > > > + * server to do pickup whenever it naturally runs next. > > > > No, I never suggested we needed per-server runnable queues: in all my > > patchsets I had a single list of idle (runnable) workers. > > This is not about the idle servers.. > > So without the LAZY thing on, a previously blocked task hitting sys_exit > will enqueue itself on the runnable list and wake the server for pickup. How can a blocked task hit sys_exit()? Shouldn't it be RUNNING? Anyway, servers and workers are supposed to unregister before exiting, so if they call sys_exit() they break the agreement; in my patch I just clear all umcg-related state and proceed, without waking the server: the user broke the protocol, let them figure out what happened: +static void umcg_clear_task(struct task_struct *tsk) +{ + /* + * This is either called for the current task, or for a newly forked + * task that is not yet running, so we don't need strict atomicity + * below. + */ + if (tsk->umcg_task) { + WRITE_ONCE(tsk->umcg_task, NULL); + + /* These can be simple writes - see the comment above. */ + tsk->pinned_umcg_worker_page = NULL; + tsk->pinned_umcg_server_page = NULL; + tsk->flags &= ~PF_UMCG_WORKER; + } +} + +/* Called both by normally (unregister) and abnormally exiting workers. */ +void umcg_handle_exiting_worker(void) +{ + umcg_unpin_pages(); + umcg_clear_task(current); +} > > IIRC you didn't like the server waking while it was still running > another task, but instead preferred to have it pick up the newly > enqueued task when next it ran. Yes, this is the model I have, as I outlined in another email. I understand that having queues per-CPU/per-server is how it is done in the kernel, both for historical reasons (before multiprocessing there was a single queue/cpu) and for throughput (per-cpu runqueues are individually faster than a global one). However, this model is known to lag in presence of load spikes (long per-cpu queues with some CPUs idle), and is not really easy to work with given the use cases this whole userspace scheduling effort is trying to address: multiple priorities and work isolation: these are easy to address directly with a scheduler that has a global view rather than multiple per-cpu/per-server schedulers/queues that try to coordinate. I can even claim (without proof, just a hunch, based on how I would code this) that strict scheduling policies around priority and isolation (e.g. never run work item A if work item B becomes runnable, unless work item A is already running) cannot be enforced without a global scheduler, so per-cpu/per-server queues do not really fit the use case here... > > LAZY enables that.. *however* it does need to wake the server when it is > idle, otherwise they'll all sit there waiting for one another. If all servers are busy running workers, then it is not up to the kernel to "preempt" them in my model: the userspace can set up another thread/task to preempt a misbehaving worker, which will wake the server attached to it. But in practice there are always workers blocking in the kernel, which wakes their servers, which then reap the woken/runnable workers list, so well-behaving code does not need this. Yes, sometimes the code does not behave well, e.g. a worker grabs a spinlock, blocks in the kernel, its server runs another worker that starts spinning on the spinlock; but this is fixable by making the spinlock aware of our stuff: either the worker who got the lock is marked as LOCKED and so does not release its server (one of the reasons I have this flag), or the lock itself becomes sleepable (e.g. after spinning a bit it calls into a futex wait). And so we need to figure out this high-level thing first: do we go with the per-server worker queues/lists, or do we go with the approach I use in my patchset? It seems to me that the kernel-side code in my patchset is not more complicated than your patchset is shaping up to be, and some things are actually easier to accomplish, like having a single idle_server_ptr vs this LAZY and/or server "preemption" behavior that you have. Again, I'm OK with having it your way if all needed features are covered, but I think we should be explicit about why per-server/per-cpu model is chosen vs the one I proposed, especially as it seems the kernel side code is not really simpler in the end.