On Fri, Oct 04, 2019 at 06:45:21AM -0700, Daniel Colascione wrote: > On Fri, Oct 4, 2019 at 6:26 AM Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote: > > On Fri, Oct 04, 2019 at 02:33:49PM +0200, Michal Hocko wrote: > > > On Wed 02-10-19 19:08:16, Daniel Colascione wrote: > > > > On Wed, Oct 2, 2019 at 6:56 PM Qian Cai <cai@xxxxxx> wrote: > > > > > > On Oct 2, 2019, at 4:29 PM, Daniel Colascione <dancol@xxxxxxxxxx> wrote: > > > > > > > > > > > > Adding the correct linux-mm address. > > > > > > > > > > > > > > > > > >> +config SPLIT_RSS_COUNTING > > > > > >> + bool "Per-thread mm counter caching" > > > > > >> + depends on MMU > > > > > >> + default y if NR_CPUS >= SPLIT_PTLOCK_CPUS > > > > > >> + help > > > > > >> + Cache mm counter updates in thread structures and > > > > > >> + flush them to visible per-process statistics in batches. > > > > > >> + Say Y here to slightly reduce cache contention in processes > > > > > >> + with many threads at the expense of decreasing the accuracy > > > > > >> + of memory statistics in /proc. > > > > > >> + > > > > > >> endmenu > > > > > > > > > > All those vague words are going to make developers almost > > > > > impossible to decide the right selection here. It sounds like we > > > > > should kill SPLIT_RSS_COUNTING at all to simplify the code as > > > > > the benefit is so small vs the side-effect? > > > > > > > > Killing SPLIT_RSS_COUNTING would be my first choice; IME, on mobile > > > > and a basic desktop, it doesn't make a difference. I figured making it > > > > a knob would help allay concerns about the performance impact in more > > > > extreme configurations. > > > > > > I do agree with Qian. Either it is really helpful (is it? probably on > > > the number of cpus) and it should be auto-enabled or it should be > > > dropped altogether. You cannot really expect people know how to enable > > > this without a deep understanding of the MM internals. Not to mention > > > all those users using distro kernels/configs. > > > > > > A config option sounds like a bad way forward. > > > > And I don't see much point anyway. Reading RSS counters from proc is > > inherently racy. It can just either way after the read due to process > > behaviour. > > Split RSS accounting doesn't make reading from mm counters racy. It > makes these counters *wrong*. We flush task mm counters to the > mm_struct once every 64 page faults that a task incurs or when that > task exits. That means that if a thread takes 63 page faults and then > sleeps for a week, that thread's process's mm counters are wrong by 63 > pages *for a week*. And some processes have a lot of threads, > compounding the error. Split RSS accounting means that memory usage > numbers don't add up. I don't think it's unreasonable to want a mode > where memory counters to agree with other indicators of system > activity. It's documented behaviour that is upstream for 9 years. Why is your workload special? The documentation suggests to use smaps if you want to have precise data. Why would it not fly for you? > Nobody has demonstrated that split RSS accounting actually helps in > the real world. The original commit 34e55232e59f ("mm: avoid false sharing of mm_counter") shows numbers on cache misses. It's not a real world workload, but you don't have any numbers at all to back your claim. > But I've described above, concretely, how split RSS > accounting hurts. I've been trying for over a year to either disable > split RSS accounting or to let people opt out of it. If you won't > remove split RSS accounting and you won't let me add a configuration > knob that lets people opt out of it, what will you accept? Keeping stats precise is welcome, but often expensive. It might be negligible for small machine, but becomes a problem on multisocket machine with dozens or hundreds of cores. We need to keep kernel scalable. We have other stats that update asynchronously (i.e. /proc/vmstat). Would you like to convert them too? -- Kirill A. Shutemov