On Tue, Oct 16, 2018 at 07:13:48PM +0200, Thomas Gleixner wrote: > On Tue, 16 Oct 2018, Frederic Weisbecker wrote: > > > On Mon, Oct 15, 2018 at 10:50:48AM +0100, Mel Gorman wrote: > > > On Fri, Oct 12, 2018 at 09:21:41AM +0200, Vlastimil Babka wrote: > > > > On 9/14/18 4:59 PM, Sebastian Andrzej Siewior wrote: > > > > I think this evaluation is missing the other side of the story, and > > > > that's the cost of using a spinlock (even uncontended) instead of > > > > disabling preemption. The expectation for LRU pagevec is that the local > > > > operations will be much more common than draining of other CPU's, so > > > > it's optimized for the former. > > > > > > > > > > Agreed, the drain operation should be extremely rare except under heavy > > > memory pressure, particularly if mixed with THP allocations. The overall > > > intent seems to be improving lockdep coverage but I don't think we > > > should take a hit in the common case just to get that coverage. Bear in > > > mind that the main point of the pagevec (whether it's true or not) is to > > > avoid the much heavier LRU lock. > > > > So indeed, if the only purpose of this patch were to make lockdep wiser, > > a pair of spin_lock_acquire() / spin_unlock_release() would be enough to > > teach it and would avoid the overhead. > > > > Now another significant incentive behind this change is to improve CPU isolation. > > Workloads relying on owning the entire CPU without being disturbed are interested > > in this as it allows to offload some noise. It's no big deal for those who can > > tolerate rare events but often CPU isolation is combined with deterministic latency > > requirements. > > > > So, I'm not saying this per-CPU spinlock is necessarily the right answer, I > > don't know that code enough to have an opinion, but I still wish we can find > > a solution. > > One way to solve this and I had played with it already is to make the smp > function call based variant and the lock based variant switchable at boot > time with a static key. That way CPU isolation can select it and take the > penalty while normal workloads are not affected. Sounds good, and we can toggle that with "isolcpus=". We could also make it modifiable through cpuset.sched_load_balance, despite its name it's not just used to govern sched balancing but isolation in general. Now making it toggable at runtime could be a bit trickier in terms of correctness, yet probably needed in the long term.