Thomas Gleixner <tglx@xxxxxxxxxxxxx> writes: > On Tue, Sep 19 2023 at 14:30, Thomas Gleixner wrote: >> On Mon, Sep 18 2023 at 18:57, Linus Torvalds wrote: >>> Then the question becomes whether we'd want to introduce a *new* >>> concept, which is a "if you are going to schedule, do it now rather >>> than later, because I'm taking a lock, and while it's a preemptible >>> lock, I'd rather not sleep while holding this resource". >>> >>> I suspect we want to avoid that for now, on the assumption that it's >>> hopefully not a problem in practice (the recently addressed problem >>> with might_sleep() was that it actively *moved* the scheduling point >>> to a bad place, not that scheduling could happen there, so instead of >>> optimizing scheduling, it actively pessimized it). But I thought I'd >>> mention it. >> >> I think we want to avoid that completely and if this becomes an issue, >> we rather be smart about it at the core level. >> >> It's trivial enough to have a per task counter which tells whether a >> preemtible lock is held (or about to be acquired) or not. Then the >> scheduler can take that hint into account and decide to grant a >> timeslice extension once in the expectation that the task leaves the >> lock held section soonish and either returns to user space or schedules >> out. It still can enforce it later on. >> >> We really want to let the scheduler decide and rather give it proper >> hints at the conceptual level instead of letting developers make random >> decisions which might work well for a particular use case and completely >> suck for the rest. I think we wasted enough time already on those. > > Finally I realized why cond_resched() & et al. are so disgusting. They > are scope-less and just a random spot which someone decided to be a good > place to reschedule. > > But in fact the really relevant measure is scope. Full preemption is > scope based: > > preempt_disable(); > do_stuff(); > preempt_enable(); > > which also nests properly: > > preempt_disable(); > do_stuff() > preempt_disable(); > do_other_stuff(); > preempt_enable(); > preempt_enable(); > > cond_resched() cannot nest and is obviously scope-less. That's true. Though, I would argue that another way to look at cond_resched() might be that it summarizes two kinds of state. First, the timer/resched activity that might cause you to schedule. The second, as an annotation from the programmer summarizing their understanding of the state of the execution stack and that there are no resources held across the current point. The second is, as you say, hard to get right -- because there's no clear definition of what it means for us to get it right, resulting in random placement of cond_rescheds() until latency improves. In any case this summary of execution state is done better by just always tracking preemption scope. > The TIF_ALLOW_RESCHED mechanism, which sparked this discussion only > pretends to be scoped. > > As Peter pointed out it does not properly nest with other mechanisms and > it cannot even nest in itself because it is boolean. > > The worst thing about it is that it is semantically reverse to the > established model of preempt_disable()/enable(), > i.e. allow_resched()/disallow_resched(). Can't disagree with that. In part it was that way because I was trying to provide an alternative to cond_resched() while executing in a particular preemptible scope -- except for not actually having any real notion of scoping. > > So instead of giving the scheduler a hint about 'this might be a good > place to preempt', providing proper scope would make way more sense: > > preempt_lazy_disable(); > do_stuff(); > preempt_lazy_enable(); > > That would be the obvious and semantically consistent counterpart to the > existing preemption control primitives with proper nesting support. > > might_sleep(), which is in all the lock acquire functions or your > variant of hint (resched better now before I take the lock) are the > wrong place. > > hint(); > lock(); > do_stuff(); > unlock(); > > hint() might schedule and when the task comes back schedule immediately > again because the lock is contended. hint() does again not have scope > and might be meaningless or even counterproductive if called in a deeper > callchain. Perhaps another problem is that some of these hints are useful for two different things: as an annotation about the state of execution, and also as a hint to the scheduler. For instance, this fix that Linus pointed to a few days ago: 4542057e18ca ("mm: avoid 'might_sleep()' in get_mmap_lock_carefully()"). is using might_sleep() in the first sense. Thanks -- ankur