Re: [POC][RFC][PATCH v2] sched: Extended Scheduler Time Slice

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Fri, 27 Oct 2023 12:35:56 -0400

On 2023-10-27 12:24, Steven Rostedt wrote:
On Fri, 27 Oct 2023 12:09:56 -0400
Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote:

I need to clear one bit while seeing if another bit is set. I could also
use subl, as that would also atomically clear the bit.

Ah ok, I did not get that you needed to test for a different bit than
the one you clear.

Yeah, maybe I'm not articulating the implementation as well.

   bit 0: Set by user space to tell the kernel it's in a critical section

   bit 1: Set by kernel that it gave user space extend time slice

Bit 1 will only be set by the kernel if bit 0 is set.

When entering a critical section, user space will set bit 0. When it leaves
the critical section, it needs to clear bit 0, but also needs to handle the
race condition from where it clears the bit and where the kernel could
preempt it and set bit 1. Thus it needs an atomic operation to clear bit 0
without affecting bit 1. Once bit 0 is cleared, it does not need to worry
about bit 1 being set after that as the kernel will only set bit 1 if it
sees that bit 0 was set. After user space clears bit 0, it must check bit 1
to see if it should now schedule. And it's also up to user space to clear
bit 1, but it can do that at any time with bit 0 cleared.

  extend() {
	cr_flags = 1;
  }

  unextend() {
	cr_flags &= ~1;  /* Must be atomic */
	if (cr_flags & 2) {
		cr_flags = 0;  /* May not be necessary as it gets cleared by extend() */
		sched_yield();
	}
  }

Does that make more sense?

Not really.

Please see my other email about the need for a reference count here, for
nested locks use-cases.

By "atomic" operation I suspect you only mean "single instruction" which can
alter the state of the field and keep its prior content in a register, not a
lock-prefixed atomic operation, right ?

The only reason why you have this asm trickiness is because both states
are placed into different bits from the same word, which is just an
optimization. You could achieve the same much more simply by splitting
this state in two different words, e.g.:

extend() {
  WRITE_ONCE(__rseq_abi->cr_nest, __rseq_abi->cr_nest + 1);
  barrier()
}

unextend() {
  barrier()
  WRITE_ONCE(__rseq_abi->cr_nest, __rseq_abi->cr_nest - 1);
  if (READ_ONCE(__rseq_abi->must_yield)) {
    WRITE_ONCE(__rseq_abi->must_yield, 0);
    sched_yield();
  }
}

Or am I missing something ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com