Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes: > On Sat, 9 Sept 2023 at 20:49, Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote: >> >> I think we can keep these checks, but with this fixed definition of >> resched_allowed(). This might be better: >> >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -2260,7 +2260,8 @@ static inline void disallow_resched(void) >> >> static __always_inline bool resched_allowed(void) >> { >> - return unlikely(test_tsk_thread_flag(current, TIF_RESCHED_ALLOW)); >> + return unlikely(!preempt_count() && >> + test_tsk_thread_flag(current, TIF_RESCHED_ALLOW)); >> } > > I'm not convinced (at all) that the preempt count is the right thing. > > It works for interrupts, yes, because interrupts will increment the > preempt count even on non-preempt kernels (since the preempt count is > also the interrupt context level). > > But what about any synchronous trap handling? > > In other words, just something like a page fault? A page fault doesn't > increment the preemption count (and in fact many page faults _will_ > obviously re-schedule as part of waiting for IO). > > A page fault can *itself* say "feel free to preempt me", and that's one thing. > > But a page fault can also *interupt* something that said "feel free to > preempt me", and that's a completely *different* thing. > > So I feel like the "tsk_thread_flag" was sadly completely the wrong > place to add this bit to, and the wrong place to test it in. What we > really want is "current kernel entry context". So, what we want allow_resched() to say is: feel free to reschedule if in a reschedulable context. The problem with doing that with an allow_resched tsk_thread_flag is that the flag is really only valid while it is executing in the context it was set. And, trying to validate the flag by checking the preempt_count() makes it pretty fragile, given that now we are tying it with the specifics of whether the handling of arbitrary interrupts bumps up the preempt_count() or not. > So the right thing to do would basically be to put it in the stack > frame at kernel entry - whether that kernel entry was a system call > (which is doing some big copy that should be preemptible without us > having to add "cond_resched()" in places), or is a page fault (which > will also do things like big page clearings for hugepages) Seems to me that associating an allow_resched flag with the stack also has similar issue. Couldn't the context level change while we are on the same stack? I guess the problem is that allow_resched()/disallow_resched() really need to demarcate a section of code having some property, but instead set up state that has much wider scope. Maybe code that allows resched can be in a new .section ".text.resched" or whatever, and we could use something like this as a check: int resched_allowed(regs) { return !preempt_count() && in_resched_function(regs->rip); } (allow_resched()/disallow_resched() shouldn't be needed except for debug checks.) We still need the !preempt_count() check, but now both the conditions in the test express two orthogonal ideas: - !preempt_count(): preemption is safe in the current context - in_resched_function(regs->rip): okay to reschedule here So in this example, it should allow scheduling inside both the clear_pages_reschedulable() calls: -> page_fault() clear_page_reschedulable(); -> page_fault() clear_page_reschedulable(); Here though, rescheduling could happen only in the first call to clear_page_reschedulable(): -> page_fault() clear_page_reschedulable(); -> hardirq() -> page_fault() clear_page_reschedulable(); Does that make any sense, or I'm just talking through my hat? -- ankur