On Wed, Aug 23, 2023 at 11:49:15AM +0200, Jan Kara wrote: > On Tue 22-08-23 16:24:56, Mateusz Guzik wrote: > > On 8/22/23, Jan Kara <jack@xxxxxxx> wrote: > > > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote: > > >> On 8/21/23, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > > >> > True Fix(tm) is a longer story. > > >> > > > >> > Maybe let's sort out this patchset first, whichever way. :) > > >> > > > >> > > >> So I found the discussion around the original patch with a perf > > >> regression report. > > >> > > >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/ > > >> > > >> The reporter suggests dodging the problem by only allocating per-cpu > > >> counters when the process is going multithreaded. Given that there is > > >> still plenty of forever single-threaded procs out there I think that's > > >> does sound like a great plan regardless of what happens with this > > >> patchset. > > >> > > >> Almost all access is already done using dedicated routines, so this > > >> should be an afternoon churn to sort out, unless I missed a > > >> showstopper. (maybe there is no good place to stuff a flag/whatever > > >> other indicator about the state of counters?) > > >> > > >> That said I'll look into it some time this or next week. > > > > > > Good, just let me know how it went, I also wanted to start looking into > > > this to come up with some concrete patches :). What I had in mind was that > > > we could use 'counters == NULL' as an indication that the counter is still > > > in 'single counter mode'. > > > > > > > In the current state there are only pointers to counters in mm_struct > > and there is no storage for them in task_struct. So I don't think > > merely null-checking the per-cpu stuff is going to cut it -- where > > should the single-threaded counters land? > > I think you misunderstood. What I wanted to do it to provide a new flavor > of percpu_counter (sharing most of code and definitions) which would have > an option to start as simple counter (indicated by pcc->counters == NULL > and using pcc->count for counting) and then be upgraded by a call to real > percpu thing. Because I think such counters would be useful also on other > occasions than as rss counters. > Kent wrote something similar and sent it out last year [1]. However, the case slightly differs from what we'd want here, 1 -> 2 threads becomes percpu vs update rate which a single thread might be able to trigger? [1] https://lore.kernel.org/lkml/20230501165450.15352-8-surenb@xxxxxxxxxx/ Thanks, Dennis > > Bonus problem, non-current can modify these counters and this needs to > > be safe against current playing with them at the same time. (and it > > would be a shame to require current to use atomic on them) > > Hum, I didn't realize that. Indeed I can see that e.g. khugepaged can be > modifying the counters for other processes. Thanks for pointing this out. > > > That said, my initial proposal adds a union: > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > index 5e74ce4a28cd..ea70f0c08286 100644 > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -737,7 +737,11 @@ struct mm_struct { > > > > unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for > > /proc/PID/auxv */ > > > > - struct percpu_counter rss_stat[NR_MM_COUNTERS]; > > + union { > > + struct percpu_counter rss_stat[NR_MM_COUNTERS]; > > + u64 *rss_stat_single; > > + }; > > + bool magic_flag_stuffed_elsewhere; > > > > struct linux_binfmt *binfmt; > > > > > > Then for single-threaded case an area is allocated for NR_MM_COUNTERS > > countes * 2 -- first set updated without any synchro by current > > thread. Second set only to be modified by others and protected with > > mm->arg_lock. The lock protects remote access to the union to begin > > with. > > arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme > with two counters is clever but I'm not 100% convinced the complexity is > really worth it. I'm not sure the overhead of always using an atomic > counter would really be measurable as atomic counter ops in local CPU cache > tend to be cheap. Did you try to measure the difference? > > If the second counter proves to be worth it, we could make just that one > atomic to avoid the need for abusing some spinlock. > > > Transition to per-CPU operation sets the magic flag (there is plenty > > of spare space in mm_struct, I'll find a good home for it without > > growing the struct). It would be a one-way street -- a process which > > gets a bunch of threads and goes back to one stays with per-CPU. > > Agreed with switching to be a one-way street. > > > Then you get the true value of something by adding both counters. > > > > arg_lock is sparingly used, so remote ops are not expected to contend > > with anything. In fact their cost is going to go down since percpu > > summation takes a spinlock which also disables interrupts. > > > > Local ops should be about the same in cost as they are right now. > > > > I might have missed some detail in the above description, but I think > > the approach is decent. > > Honza > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR