On Mon, 18 Sept 2023 at 16:42, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: > > What about the following: > > 1) Keep preemption count and the real preemption points enabled > unconditionally. Well, it's certainly the simplest solution, and gets rid of not just the 'rep string' issue, but gets rid of all the cond_resched() hackery entirely. > 20 years ago this was a real issue because we did not have: > > - the folding of NEED_RESCHED into the preempt count > > - the cacheline optimizations which make the preempt count cache > pretty much always cache hot > > - the hardware was way less capable > > I'm not saying that preempt_count is completely free today as it > obviously adds more text and affects branch predictors, but as the > major distros ship with DYNAMIC_PREEMPT enabled it is obviously an > acceptable and tolerable tradeoff. Yeah, the fact that we do presumably have PREEMPT_COUNT enabled in most distros does speak for just admitting that the PREEMPT_NONE / VOLUNTARY approach isn't actually used, and is only causing pain. > 2) When the scheduler wants to set NEED_RESCHED due it sets > NEED_RESCHED_LAZY instead which is only evaluated in the return to > user space preemption points. Is this just to try to emulate the existing PREEMPT_NONE behavior? If the new world order is that the time slice is always honored, then the "this might be a latency issue" goes away. Good. And we'd also get better coverage for the *debug* aim of "might_sleep()" and CONFIG_DEBUG_ATOMIC_SLEEP, since we'd rely on PREEMPT_COUNT always existing. But because the latency argument is gone, the "might_resched()" should then just be removed entirely from "might_sleep()", so that might_sleep() would *only* be that DEBUG_ATOMIC_SLEEP thing. That argues for your suggestion too, since we had a performance issue due to "might_sleep()" _not_ being just a debug thing, and pointlessly causing a reschedule in a place where reschedules were _allowed_, but certainly much less than optimal. Which then caused that fairly recent commit 4542057e18ca ("mm: avoid 'might_sleep()' in get_mmap_lock_carefully()"). However, that does bring up an issue: even with full preemption, there are certainly places where we are *allowed* to schedule (when the preempt count is zero), but there are also some places that are *better* than other places to schedule (for example, when we don't hold any other locks). So, I do think that if we just decide to go "let's just always be preemptible", we might still have points in the kernel where preemption might be *better* than in others points. But none of might_resched(), might_sleep() _or_ cond_resched() are necessarily that kind of "this is a good point" thing. They come from a different background. So what I think what you are saying is that we'd have the following situation: - scheduling at "return to user space" is presumably always a good thing. A non-preempt-count bit NEED_RESCHED_LAZY (or TIF_RESCHED, or whatever) would cover that, and would give us basically the existing CONFIG_PREEMPT_NONE behavior. So a config variable (either compile-time with PREEMPT_NONE or a dynamic one with DYNAMIC_PREEMPT set to none) would make any external wakeup only set that bit. And then a "fully preemptible low-latency desktop" would set the preempt-count bit too. - but the "timeslice over" case would always set the preempt-count-bit, regardless of any config, and would guarantee that we have reasonable latencies. This all makes cond_resched() (and might_resched()) pointless, and they can just go away. Then the question becomes whether we'd want to introduce a *new* concept, which is a "if you are going to schedule, do it now rather than later, because I'm taking a lock, and while it's a preemptible lock, I'd rather not sleep while holding this resource". I suspect we want to avoid that for now, on the assumption that it's hopefully not a problem in practice (the recently addressed problem with might_sleep() was that it actively *moved* the scheduling point to a bad place, not that scheduling could happen there, so instead of optimizing scheduling, it actively pessimized it). But I thought I'd mention it. Anyway, I'm definitely not opposed. We'd get rid of a config option that is presumably not very widely used, and we'd simplify a lot of issues, and get rid of all these badly defined "cond_preempt()" things. Linus