On Fri, Oct 4, 2019 at 6:26 AM Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote: > On Fri, Oct 04, 2019 at 02:33:49PM +0200, Michal Hocko wrote: > > On Wed 02-10-19 19:08:16, Daniel Colascione wrote: > > > On Wed, Oct 2, 2019 at 6:56 PM Qian Cai <cai@xxxxxx> wrote: > > > > > On Oct 2, 2019, at 4:29 PM, Daniel Colascione <dancol@xxxxxxxxxx> wrote: > > > > > > > > > > Adding the correct linux-mm address. > > > > > > > > > > > > > > >> +config SPLIT_RSS_COUNTING > > > > >> + bool "Per-thread mm counter caching" > > > > >> + depends on MMU > > > > >> + default y if NR_CPUS >= SPLIT_PTLOCK_CPUS > > > > >> + help > > > > >> + Cache mm counter updates in thread structures and > > > > >> + flush them to visible per-process statistics in batches. > > > > >> + Say Y here to slightly reduce cache contention in processes > > > > >> + with many threads at the expense of decreasing the accuracy > > > > >> + of memory statistics in /proc. > > > > >> + > > > > >> endmenu > > > > > > > > All those vague words are going to make developers almost impossible to decide the right selection here. It sounds like we should kill SPLIT_RSS_COUNTING at all to simplify the code as the benefit is so small vs the side-effect? > > > > > > Killing SPLIT_RSS_COUNTING would be my first choice; IME, on mobile > > > and a basic desktop, it doesn't make a difference. I figured making it > > > a knob would help allay concerns about the performance impact in more > > > extreme configurations. > > > > I do agree with Qian. Either it is really helpful (is it? probably on > > the number of cpus) and it should be auto-enabled or it should be > > dropped altogether. You cannot really expect people know how to enable > > this without a deep understanding of the MM internals. Not to mention > > all those users using distro kernels/configs. > > > > A config option sounds like a bad way forward. > > And I don't see much point anyway. Reading RSS counters from proc is > inherently racy. It can just either way after the read due to process > behaviour. Split RSS accounting doesn't make reading from mm counters racy. It makes these counters *wrong*. We flush task mm counters to the mm_struct once every 64 page faults that a task incurs or when that task exits. That means that if a thread takes 63 page faults and then sleeps for a week, that thread's process's mm counters are wrong by 63 pages *for a week*. And some processes have a lot of threads, compounding the error. Split RSS accounting means that memory usage numbers don't add up. I don't think it's unreasonable to want a mode where memory counters to agree with other indicators of system activity. Nobody has demonstrated that split RSS accounting actually helps in the real world. But I've described above, concretely, how split RSS accounting hurts. I've been trying for over a year to either disable split RSS accounting or to let people opt out of it. If you won't remove split RSS accounting and you won't let me add a configuration knob that lets people opt out of it, what will you accept?