Thank you, I can try to do that. It's not trivial to get right though. I have to find the right compromise. A horribly wrong patch won't be taken seriously, but a completely correct one would be a bit too much work, given the probability that it will get rejected. Thanks also to Johannes for the clarification! On Wed, Apr 24, 2019 at 7:49 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Tue, Apr 23, 2019 at 9:54 PM Luigi Semenzato <semenzato@xxxxxxxxxx> wrote: > > > > Thank you very much Suren. > > > > On Tue, Apr 23, 2019 at 3:04 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > > > > Hi Luigi, > > > > > > On Tue, Apr 23, 2019 at 11:58 AM Luigi Semenzato <semenzato@xxxxxxxxxx> wrote: > > > > > > > > I and others are working on improving system behavior under memory > > > > pressure on Chrome OS. We use zram, which swaps to a > > > > statically-configured compressed RAM disk. One challenge that we have > > > > is that the footprint of our workloads is highly variable. With zram, > > > > we have to set the size of the swap partition at boot time. When the > > > > (logical) swap partition is full, we're left with some amount of RAM > > > > usable by file and anonymous pages (we can ignore the rest). We don't > > > > get to control this amount dynamically. Thus if the workload fits > > > > nicely in it, everything works well. If it doesn't, then the rate of > > > > anonymous page faults can be quite high, causing large CPU overhead > > > > for compression/decompression (as well as for other parts of the MM). > > > > > > > > In Chrome OS and Android, we have the luxury that we can reduce > > > > pressure by terminating processes (tab discard in Chrome OS, app kill > > > > in Android---which incidentally also runs in parallel with Chrome OS > > > > on some chromebooks). To help decide when to reduce pressure, we > > > > would like to have a reliable and device-independent measure of MM CPU > > > > overhead. I have looked into PSI and have a few questions. I am also > > > > looking for alternative suggestions. > > > > > > > > PSI measures the times spent when some and all tasks are blocked by > > > > memory allocation. In some experiments, this doesn't seem to > > > > correlate too well with CPU overhead (which instead correlates fairly > > > > well with page fault rates). Could this be because it includes > > > > pressure from file page faults? > > > > > > This might be caused by thrashing (see: > > > https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1114). > > > > > > > Is there some way of interpreting PSI > > > > numbers so that the pressure from file pages is ignored? > > > > > > I don't think so but I might be wrong. Notice here > > > https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1111 > > > you could probably use delayacct to distinguish file thrashing, > > > however remember that PSI takes into account the number of CPUs and > > > the number of currently non-idle tasks in its pressure calculations, > > > so the raw delay numbers might not be very useful here. > > > > OK. > > > > > > What is the purpose of "some" and "full" in the PSI measurements? The > > > > chrome browser is a multi-process app and there is a lot of IPC. When > > > > process A is blocked on memory allocation, it cannot respond to IPC > > > > from process B, thus effectively both processes are blocked on > > > > allocation, but we don't see that. > > > > > > I don't think PSI would account such an indirect stall when A is > > > waiting for B and B is blocked on memory access. B's stall will be > > > accounted for but I don't think A's blocked time will go into PSI > > > calculations. The process inter-dependencies are probably out of scope > > > for PSI. > > > > Right, that's what I was also saying. It would be near impossible to > > figure it out. It may also be that statistically it doesn't matter, > > as long as the workload characteristics don't change dramatically. > > Which unfortunately they might... > > > > > > Also, there are situations in > > > > which some "uninteresting" process keep running. So it's not clear we > > > > can rely on "full". Or maybe I am misunderstanding? "Some" may be a > > > > better measure, but again it doesn't measure indirect blockage. > > > > > > Johannes explains the SOME and FULL calculations here: > > > https://elixir.bootlin.com/linux/v5.1-rc6/source/kernel/sched/psi.c#L76 > > > and includes couple examples with the last one showing FULL>0 and some > > > tasks still running. > > > > Thank you, yes, those are good explanation. I am still not sure how > > to use this in our case. > > > > I thought about using the page fault rate as a proxy for the > > allocation overhead. Unfortunately it is difficult to figure out the > > baseline, because: 1. it is device-dependent (that's not > > insurmountable: we could compute a per-device baseline offline); 2. > > the CPUs can go in and out of turbo mode, or temperature-throttling, > > and the notion of a constant "baseline" fails miserably. > > > > > > The kernel contains various cpustat measurements, including some > > > > slightly esoteric ones such as CPUTIME_GUEST and CPUTIME_GUEST_NICE. > > > > Would adding a CPUTIME_MEM be out of the question? > > > > Any opinion on CPUTIME_MEM? > > I guess some description of how you plan to calculate it would be > helpful. A simple raw delay counter might not be very useful, that's > why PSI performs more elaborate calculations. > Maybe posting a small RFC patch with code would get more attention and > you can collect more feedback. > > > Thanks again! > > > > > > Thanks! > > > > > > > > > > Just my 2 cents and Johannes being the author might have more to say here.