On 8/22/23, Jan Kara <jack@xxxxxxx> wrote: > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote: >> On 8/21/23, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: >> > True Fix(tm) is a longer story. >> > >> > Maybe let's sort out this patchset first, whichever way. :) >> > >> >> So I found the discussion around the original patch with a perf >> regression report. >> >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/ >> >> The reporter suggests dodging the problem by only allocating per-cpu >> counters when the process is going multithreaded. Given that there is >> still plenty of forever single-threaded procs out there I think that's >> does sound like a great plan regardless of what happens with this >> patchset. >> >> Almost all access is already done using dedicated routines, so this >> should be an afternoon churn to sort out, unless I missed a >> showstopper. (maybe there is no good place to stuff a flag/whatever >> other indicator about the state of counters?) >> >> That said I'll look into it some time this or next week. > > Good, just let me know how it went, I also wanted to start looking into > this to come up with some concrete patches :). What I had in mind was that > we could use 'counters == NULL' as an indication that the counter is still > in 'single counter mode'. > In the current state there are only pointers to counters in mm_struct and there is no storage for them in task_struct. So I don't think merely null-checking the per-cpu stuff is going to cut it -- where should the single-threaded counters land? Bonus problem, non-current can modify these counters and this needs to be safe against current playing with them at the same time. (and it would be a shame to require current to use atomic on them) That said, my initial proposal adds a union: diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 5e74ce4a28cd..ea70f0c08286 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -737,7 +737,11 @@ struct mm_struct { unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */ - struct percpu_counter rss_stat[NR_MM_COUNTERS]; + union { + struct percpu_counter rss_stat[NR_MM_COUNTERS]; + u64 *rss_stat_single; + }; + bool magic_flag_stuffed_elsewhere; struct linux_binfmt *binfmt; Then for single-threaded case an area is allocated for NR_MM_COUNTERS countes * 2 -- first set updated without any synchro by current thread. Second set only to be modified by others and protected with mm->arg_lock. The lock protects remote access to the union to begin with. Transition to per-CPU operation sets the magic flag (there is plenty of spare space in mm_struct, I'll find a good home for it without growing the struct). It would be a one-way street -- a process which gets a bunch of threads and goes back to one stays with per-CPU. Then you get the true value of something by adding both counters. arg_lock is sparingly used, so remote ops are not expected to contend with anything. In fact their cost is going to go down since percpu summation takes a spinlock which also disables interrupts. Local ops should be about the same in cost as they are right now. I might have missed some detail in the above description, but I think the approach is decent. -- Mateusz Guzik <mjguzik gmail.com>