Thomas Gleixner <tglx@xxxxxxxxxxxxx> writes: > On Tue, Sep 12 2023 at 10:26, Peter Zijlstra wrote: >> On Mon, Sep 11, 2023 at 10:04:17AM -0700, Ankur Arora wrote: >>> > The problem with the REP prefix (and Xen hypercalls) is that >>> > they're long running instructions and it becomes fundamentally >>> > impossible to put a cond_resched() in. >>> > >>> >> Yes. I'm starting to think that that the only sane solution is to >>> >> limit cases that can do this a lot, and the "instruciton pointer >>> >> region" approach would certainly work. >>> > >>> > From a code locality / I-cache POV, I think a sorted list of >>> > (non overlapping) ranges might be best. >>> >>> Yeah, agreed. There are a few problems with doing that though. >>> >>> I was thinking of using a check of this kind to schedule out when >>> it is executing in this "reschedulable" section: >>> !preempt_count() && in_resched_function(regs->rip); >>> >>> For preemption=full, this should mostly work. >>> For preemption=voluntary, though this'll only work with out-of-line >>> locks, not if the lock is inlined. >>> >>> (Both, should have problems with __this_cpu_* and the like, but >>> maybe we can handwave that away with sparse/objtool etc.) >> >> So one thing we can do is combine the TIF_ALLOW_RESCHED with the ranges >> thing, and then only search the range when TIF flag is set. >> >> And I'm thinking it might be a good idea to have objtool validate the >> range only contains simple instructions, the moment it contains control >> flow I'm thinking it's too complicated. > > Can we take a step back and look at the problem from a scheduling > perspective? > > The basic operation of a non-preemptible kernel is time slice > scheduling, which means that a task can run more or less undisturbed for > a full time slice once it gets on the CPU unless it schedules away > voluntary via a blocking operation. > > This works pretty well as long as everything runs in userspace as the > preemption points in the return to user space path are independent of > the preemption model. > > These preemption points handle both time slice exhaustion and priority > based preemption. > > With PREEMPT=NONE these are the only available preemption points. > > That means that kernel code can run more or less indefinitely until it > schedules out or returns to user space, which is obviously not possible > for kernel threads. > > To prevent starvation the kernel gained voluntary preemption points, > i.e. cond_resched(), which has to be added manually to code as a > developer sees fit. > > Later we added PREEMPT=VOLUNTARY which utilizes might_resched() as > additional preemption points. might_resched() utilizes the existing > might_sched() debug points, which are in code paths which might block on > a contended resource. These debug points are mostly in core and > infrastructure code and are in code paths which can block anyway. The > only difference is that they allow preemption even when the resource is > uncontended. > > Additionally we have PREEMPT=FULL which utilizes every zero transition > of preeempt_count as a potential preemption point. > > Now we have the situation of long running data copies or data clear > operations which run fully in hardware, but can be interrupted. As the > interrupt return to kernel mode does not preempt in the NONE and > VOLUNTARY cases, new workarounds emerged. Mostly by defining a data > chunk size and adding cond_reched() again. > > That's ugly and does not work for long lasting hardware operations so we > ended up with the suggestion of TIF_ALLOW_RESCHED to work around > that. But again this needs to be manually annotated in the same way as a > IP range based preemption scheme requires annotation. > > TBH. I detest all of this. > > Both cond_resched() and might_sleep/sched() are completely random > mechanisms as seen from time slice operation and the data chunk based > mechanism is just heuristics which works as good as heuristics tend to > work. allow_resched() is not any different and IP based preemption > mechanism are not going to be any better. Agreed. I was looking at how to add resched sections etc, and in addition to the randomness the choice of where exactly to add it seemed to be quite fuzzy. A recipe for future kruft. > The approach here is: Prevent the scheduler to make decisions and then > mitigate the fallout with heuristics. > > That's just backwards as it moves resource control out of the scheduler > into random code which has absolutely no business to do resource > control. > > We have the reverse issue observed in PREEMPT_RT. The fact that spinlock > held sections became preemtible caused even more preemption activity > than on a PREEMPT=FULL kernel. The worst side effect of that was > extensive lock contention. > > The way how we addressed that was to add a lazy preemption mode, which > tries to preserve the PREEMPT=FULL behaviour when the scheduler wants to > preempt tasks which all belong to the SCHED_OTHER scheduling class. This > works pretty well and gains back a massive amount of performance for the > non-realtime throughput oriented tasks without affecting the > schedulability of real-time tasks at all. IOW, it does not take control > away from the scheduler. It cooperates with the scheduler and leaves the > ultimate decisions to it. > > I think we can do something similar for the problem at hand, which > avoids most of these heuristic horrors and control boundary violations. > > The main issue is that long running operations do not honour the time > slice and we work around that with cond_resched() and now have ideas > with this new TIF bit and IP ranges. > > None of that is really well defined in respect to time slices. In fact > its not defined at all versus any aspect of scheduling behaviour. > > What about the following: > > 1) Keep preemption count and the real preemption points enabled > unconditionally. That's not more overhead than the current > DYNAMIC_PREEMPT mechanism as long as the preemption count does not > go to zero, i.e. the folded NEED_RESCHED bit stays set. > > From earlier experiments I know that the overhead of preempt_count > is minimal and only really observable with micro benchmarks. > Otherwise it ends up in the noise as long as the slow path is not > taken. > > I did a quick check comparing a plain inc/dec pair vs. the > DYMANIC_PREEMPT inc/dec_and_test+NOOP mechanism and the delta is > in the non-conclusive noise. > > 20 years ago this was a real issue because we did not have: > > - the folding of NEED_RESCHED into the preempt count > > - the cacheline optimizations which make the preempt count cache > pretty much always cache hot > > - the hardware was way less capable > > I'm not saying that preempt_count is completely free today as it > obviously adds more text and affects branch predictors, but as the > major distros ship with DYNAMIC_PREEMPT enabled it is obviously an > acceptable and tolerable tradeoff. > > 2) When the scheduler wants to set NEED_RESCHED due it sets > NEED_RESCHED_LAZY instead which is only evaluated in the return to > user space preemption points. > > As NEED_RESCHED_LAZY is not folded into the preemption count the > preemption count won't become zero, so the task can continue until > it hits return to user space. > > That preserves the existing behaviour. > > 3) When the scheduler tick observes that the time slice is exhausted, > then it folds the NEED_RESCHED bit into the preempt count which > causes the real preemption points to actually preempt including > the return from interrupt to kernel path. Right, and currently we check cond_resched() all the time in expectation that something might need a resched. Folding it in with the scheduler determining when next preemption happens seems to make a lot of sense to me. Thanks Ankur > That even allows the scheduler to enforce preemption for e.g. RT > class tasks without changing anything else. > > I'm pretty sure that this gets rid of cond_resched(), which is an > impressive list of instances: > > ./drivers 392 > ./fs 318 > ./mm 189 > ./kernel 184 > ./arch 95 > ./net 83 > ./include 46 > ./lib 36 > ./crypto 16 > ./sound 16 > ./block 11 > ./io_uring 13 > ./security 11 > ./ipc 3 > > That list clearly documents that the majority of these > cond_resched() invocations is in code which neither should care > nor should have any influence on the core scheduling decision > machinery. > > I think it's worth a try as it just fits into the existing preemption > scheme, solves the issue of long running kernel functions, prevents > invalid preemption and can utilize the existing instrumentation and > debug infrastructure. > > Most importantly it gives control back to the scheduler and does not > make it depend on the mercy of cond_resched(), allow_resched() or > whatever heuristics sprinkled all over the kernel. > To me this makes a lot of sense, but I might be on the completely wrong > track. Se feel free to tell me that I'm completely nuts and/or just not > seeing the obvious. > > Thanks, > > tglx -- ankur