On Mon, 2022-11-07 at 09:10 +0100, Michal Hocko wrote: > On Fri 04-11-22 22:45:58, Leonardo Brás wrote: > > On Fri, 2022-11-04 at 09:41 +0100, Michal Hocko wrote: > > > On Thu 03-11-22 13:53:41, Leonardo Brás wrote: > > > > On Thu, 2022-11-03 at 16:31 +0100, Michal Hocko wrote: > > > > > On Thu 03-11-22 11:59:20, Leonardo Brás wrote: > > > [...] > > > > > > I understand there will be a locking cost being paid in the isolated CPUs when: > > > > > > a) The isolated CPU is requesting the stock drain, > > > > > > b) When the isolated CPUs do a syscall and end up using the protected structure > > > > > > the first time after a remote drain. > > > > > > > > > > And anytime the charging path (consume_stock resp. refill_stock) > > > > > contends with the remote draining which is out of control of the RT > > > > > task. It is true that the RT kernel will turn that spin lock into a > > > > > sleeping RT lock and that could help with potential priority inversions > > > > > but still quite costly thing I would expect. > > > > > > > > > > > Both (a) and (b) should happen during a syscall, and IIUC the a rt workload > > > > > > should not expect the syscalls to be have a predictable time, so it should be > > > > > > fine. > > > > > > > > > > Now I am not sure I understand. If you do not consider charging path to > > > > > be RT sensitive then why is this needed in the first place? What else > > > > > would be populating the pcp cache on the isolated cpu? IRQs? > > > > > > > > I am mostly trying to deal with drain_all_stock() calling schedule_work_on() at > > > > isolated_cpus. Since the scheduled drain_local_stock() will be competing for cpu > > > > time with the RT workload, we can have preemption of the RT workload, which is a > > > > problem for meeting the deadlines. > > > > > > Yes, this is understood. But it is not really clear to me why would any > > > draining be necessary for such an isolated CPU if no workload other than > > > the RT (which pressumably doesn't charge any memory?) is running on that > > > CPU? Is that the RT task during the initialization phase that leaves > > > that cache behind or something else? > > > > (I am new to this part of the code, so please correct me when I miss something.) > > > > IIUC, if a process belongs to a control group with memory control, the 'charge' > > will happen when a memory page starts getting used by it. > > Yes, very broadly speaking. > > > So, if we assume a RT load in a isolated CPU will not charge any memory, we are > > assuming it will never be part of a memory-controlled cgroup. > > If the memory cgroup controler is enabled then each user space process > is a part of some memcg. If there is no specific memcg assigned then it > will be a root cgroup and that is skipped during most charges except for > kmem. Oh, it makes sense. Thanks for helping me understand that! > > > I mean, can we just assume this? > > > > If I got that right, would not that be considered a limitation? like > > "If you don't want your workload to be interrupted by perCPU cache draining, > > don't put it in a cgroup with memory control". > > We definitely do not want userspace make any assumptions on internal > implementation details like caches. Perfect, that was my expectation. > > > > Sorry for being so focused on this > > > but I would like to understand on whether this is avoidable by a > > > different startup scheme or it really needs to be addressed in some way. > > > > No worries, I am in fact happy you are giving it this much attention :) > > > > I also understand this is a considerable change in the locking strategy, and > > avoiding that is the first thing that should be tried. > > > > > > > > > One way I thought to solve that was introducing a remote drain, which would > > > > require a different strategy for locking, since not all accesses to the pcp > > > > caches would happen on a local CPU. > > > > > > Yeah, I am not supper happy about additional spin lock TBH. One > > > potential way to go would be to completely avoid pcp cache for isolated > > > CPUs. That would have some performance impact of course but on the other > > > hand it would give a more predictable behavior for those CPUs which > > > sounds like a reasonable compromise to me. What do you think? > > > > You mean not having a perCPU stock, then? > > So consume_stock() for isolated CPUs would always return false, causing > > try_charge_memcg() always walking the slow path? > > Exactly. > > > IIUC, both my proposal and yours would degrade performance only when we use > > isolated CPUs + memcg. Is that correct? > > Yes, with a notable difference that with your spin lock option there is > still a chance that the remote draining could influence the isolated CPU > workload throug that said spinlock. If there is no pcp cache for that > cpu being used then there is no potential interaction at all. I see. But the slow path is slow for some reason, right? Does not it make use of any locks also? So on normal operation there could be a potentially larger impact than a spinlock, even though there would be no scheduled draining. > > > If so, it looks like the impact would be even bigger without perCPU stock , > > compared to introducing a spinlock. > > > > Unless, we are counting to this case where a remote CPU is draining an isolated > > CPU, and the isolated CPU faults a page, and has to wait for the spinlock to be > > released in the remote CPU. Well, this seems possible to happen, but I would > > have to analyze how often would it happen, and how much would it impact the > > deadlines. I *guess* most of the RT workload's memory pages are pre-faulted > > before its starts, so it can avoid the faulting latency, but I need to confirm > > that. > > Yes, that is a general practice and the reason why I was asking how real > of a problem that is in practice. I remember this was one common factor on deadlines being missed in the workload analyzed. Need to redo the test to be sure. > It is true true that appart from user > space memory which can be under full control of the userspace there are > kernel allocations which can be done on behalf of the process and those > could be charged to memcg as well. So I can imagine the pcp cache could > be populated even if the process is not faulting anything in during RT > sensitive phase. Humm, I think I will apply the change and do a comparative testing with upstream. This should bring good comparison results. > > > On the other hand, compared to how it works now now, this should be a more > > controllable way of introducing latency than a scheduled cache drain. > > > > Your suggestion on no-stocks/caches in isolated CPUs would be great for > > predictability, but I am almost sure the cost in overall performance would not > > be fine. > > It is hard to estimate the overhead without measuring that. Do you think > you can give it a try? If the performance is not really acceptable > (which I would be really surprised) then we can think of a more complex > solution. Sure, I can try that. Do you suggest any specific workload that happens to stress the percpu cache usage, with usual drains and so? Maybe I will also try with synthetic worloads also. > > > With the possibility of prefaulting pages, do you see any scenario that would > > introduce some undesirable latency in the workload? > > My primary concern would be spin lock contention which is hard to > predict with something like remote draining. It makes sense. I will do some testing and come out with results for that. Thanks for reviewing! Leo