Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

Ankur Arora <ankur.a.arora@xxxxxxxxxx> · Sun, 10 Sep 2023 03:01:02 -0700

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:

> On Sat, 9 Sept 2023 at 20:49, Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote:
>>
>> I think we can keep these checks, but with this fixed definition of
>> resched_allowed(). This might be better:
>>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2260,7 +2260,8 @@ static inline void disallow_resched(void)
>>
>>  static __always_inline bool resched_allowed(void)
>>  {
>> -       return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
>> +       return unlikely(!preempt_count() &&
>> +                        test_tsk_thread_flag(current, TIF_RESCHED_ALLOW));
>>  }
>
> I'm not convinced (at all) that the preempt count is the right thing.
>
> It works for interrupts, yes, because interrupts will increment the
> preempt count even on non-preempt kernels (since the preempt count is
> also the interrupt context level).
>
> But what about any synchronous trap handling?
>
> In other words, just something like a page fault? A page fault doesn't
> increment the preemption count (and in fact many page faults _will_
> obviously re-schedule as part of waiting for IO).
>
> A page fault can *itself* say "feel free to preempt me", and that's one thing.
>
> But a page fault can also *interupt* something that said "feel free to
> preempt me", and that's a completely *different* thing.
>
> So I feel like the "tsk_thread_flag" was sadly completely the wrong
> place to add this bit to, and the wrong place to test it in. What we
> really want is "current kernel entry context".

So, what we want allow_resched() to say is: feel free to reschedule
if in a reschedulable context.

The problem with doing that with an allow_resched tsk_thread_flag is
that the flag is really only valid while it is executing in the context
it was set.
And, trying to validate the flag by checking the preempt_count() makes
it pretty fragile, given that now we are tying it with the specifics of
whether the handling of arbitrary interrupts bumps up the
preempt_count() or not.

> So the right thing to do would basically be to put it in the stack
> frame at kernel entry - whether that kernel entry was a system call
> (which is doing some big copy that should be preemptible without us
> having to add "cond_resched()" in places), or is a page fault (which
> will also do things like big page clearings for hugepages)

Seems to me that associating an allow_resched flag with the stack also
has similar issue. Couldn't the context level change while we are on the
same stack?

I guess the problem is that allow_resched()/disallow_resched() really
need to demarcate a section of code having some property, but instead
set up state that has much wider scope.

Maybe code that allows resched can be in a new .section ".text.resched"
or whatever, and we could use something like this as a check:

  int resched_allowed(regs) {
        return !preempt_count() && in_resched_function(regs->rip);
  }

(allow_resched()/disallow_resched() shouldn't be needed except for
debug checks.)

We still need the !preempt_count() check, but now both the conditions
in the test express two orthogonal ideas:

  - !preempt_count(): preemption is safe in the current context
  - in_resched_function(regs->rip): okay to reschedule here

So in this example, it should allow scheduling inside both the
clear_pages_reschedulable() calls:

  -> page_fault()
     clear_page_reschedulable();
     -> page_fault()
        clear_page_reschedulable();

Here though, rescheduling could happen only in the first call to
clear_page_reschedulable():

  -> page_fault()
     clear_page_reschedulable();
     -> hardirq()
         -> page_fault()
            clear_page_reschedulable();

Does that make any sense, or I'm just talking through my hat?

--
ankur