> On Apr 3, 2018, at 2:12 PM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote: >>> Yes, very much this. If you have a single-threaded workload which is >>> using the entirety of memory and would like to use even more, then it >>> makes sense to use as many CPUs as necessary getting memory out of its >>> way. If you have N CPUs and N-1 threads happily occupying themselves in >>> their own reasonably-sized working sets with one monster process trying >>> to use as much RAM as possible, then I'd be pretty unimpressed to see >>> the N-1 well-behaved threads preempted by kswapd. >> >> The default value provides one kswapd thread per NUMA node, the same >> it was without the patch. Also, I would point out that just because you devote >> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads >> are busy, they are almost certainly doing work that would have resulted in >> direct reclaims, which are often substantially more expensive than a couple >> extra context switches due to preemption. > > [...] > >> In my previous response to Michal Hocko, I described >> how I think we could scale watermarks in response to direct reclaims, and >> launch more kswapd threads when kswapd peaks at 100% CPU usage. > > I think you're missing my point about the workload ... kswapd isn't > "nice", so it will compete with the N-1 threads which are chugging along > at 100% CPU inside their working sets. In this scenario, we _don't_ > want to kick off kswapd at all; we want the monster thread to clean up > its own mess. If we have idle CPUs, then yes, absolutely, lets have > them clean up for the monster, but otherwise, I want my N-1 threads > doing their own thing. > > Maybe we should renice kswapd anyway ... thoughts? We don't seem to have > had a nice'd kswapd since 2.6.12, but maybe we played with that earlier > and discovered it was a bad idea? > Trying to distinguish between the monster and a high value task that you want to run as quickly as possible would be challenging. I like your idea of using renice. It probably makes sense to continue to run the first thread on each node at a standard nice value, and run each additional task with a positive nice value.