On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote: > > Yes, very much this. If you have a single-threaded workload which is > > using the entirety of memory and would like to use even more, then it > > makes sense to use as many CPUs as necessary getting memory out of its > > way. If you have N CPUs and N-1 threads happily occupying themselves in > > their own reasonably-sized working sets with one monster process trying > > to use as much RAM as possible, then I'd be pretty unimpressed to see > > the N-1 well-behaved threads preempted by kswapd. > > The default value provides one kswapd thread per NUMA node, the same > it was without the patch. Also, I would point out that just because you devote > more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads > are busy, they are almost certainly doing work that would have resulted in > direct reclaims, which are often substantially more expensive than a couple > extra context switches due to preemption. [...] > In my previous response to Michal Hocko, I described > how I think we could scale watermarks in response to direct reclaims, and > launch more kswapd threads when kswapd peaks at 100% CPU usage. I think you're missing my point about the workload ... kswapd isn't "nice", so it will compete with the N-1 threads which are chugging along at 100% CPU inside their working sets. In this scenario, we _don't_ want to kick off kswapd at all; we want the monster thread to clean up its own mess. If we have idle CPUs, then yes, absolutely, lets have them clean up for the monster, but otherwise, I want my N-1 threads doing their own thing. Maybe we should renice kswapd anyway ... thoughts? We don't seem to have had a nice'd kswapd since 2.6.12, but maybe we played with that earlier and discovered it was a bad idea?