On Wed, 2022-11-09 at 09:05 +0100, Michal Hocko wrote: > On Tue 08-11-22 20:09:25, Leonardo Brás wrote: > [...] > > > Yes, with a notable difference that with your spin lock option there is > > > still a chance that the remote draining could influence the isolated CPU > > > workload throug that said spinlock. If there is no pcp cache for that > > > cpu being used then there is no potential interaction at all. > > > > I see. > > But the slow path is slow for some reason, right? > > Does not it make use of any locks also? So on normal operation there could be a > > potentially larger impact than a spinlock, even though there would be no > > scheduled draining. > > Well, for the regular (try_charge) path that is essentially page_counter_try_charge > which boils down to atomic_long_add_return of the memcg counter + all > parents up the hierarchy and high memory limit evaluation (essentially 2 > atomic_reads for the memcg + all parents up the hierchy). That is not > whole of a lot - especially when the memcg hierarchy is not very deep. > > Per cpu batch amortizes those per hierarchy updates as well as atomic > operations + cache lines bouncing on updates. > > On the other hand spinlock would do the unconditional atomic updates as > well and even much more on CONFIG_RT. A plus is that the update will be > mostly local so cache line bouncing shouldn't be terrible. Unless > somebody heavily triggers pcp cache draining but this shouldn't be all > that common (e.g. when a memcg triggers its limit. > > All that being said, I am still not convinced that the pcp cache bypass > for isolated CPUs would make a dramatic difference. Especially in the > context of workloads that tend to run on isolated CPUs and rarely enter > kernel. > > > > It is true true that appart from user > > > space memory which can be under full control of the userspace there are > > > kernel allocations which can be done on behalf of the process and those > > > could be charged to memcg as well. So I can imagine the pcp cache could > > > be populated even if the process is not faulting anything in during RT > > > sensitive phase. > > > > Humm, I think I will apply the change and do a comparative testing with > > upstream. This should bring good comparison results. > > That would be certainly appreciated! > ( > > > > On the other hand, compared to how it works now now, this should be a more > > > > controllable way of introducing latency than a scheduled cache drain. > > > > > > > > Your suggestion on no-stocks/caches in isolated CPUs would be great for > > > > predictability, but I am almost sure the cost in overall performance would not > > > > be fine. > > > > > > It is hard to estimate the overhead without measuring that. Do you think > > > you can give it a try? If the performance is not really acceptable > > > (which I would be really surprised) then we can think of a more complex > > > solution. > > > > Sure, I can try that. > > Do you suggest any specific workload that happens to stress the percpu cache > > usage, with usual drains and so? Maybe I will also try with synthetic worloads > > also. > > I really think you want to test it on the isolcpu aware workload. > Artificial benchmark are not all that useful in this context. Hello Michael, I just sent a v2 for this patchset with a lot of changes. https://lore.kernel.org/lkml/20230125073502.743446-1-leobras@xxxxxxxxxx/ I have tried to gather some data on the performance numbers as suggested, but I got carried away and the cover letter ended up too big. I hope it's not too much trouble. Best regards, Leo