On Tue, Apr 23, 2019 at 9:54 PM Luigi Semenzato <semenzato@xxxxxxxxxx> wrote: > > Thank you very much Suren. > > On Tue, Apr 23, 2019 at 3:04 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > > Hi Luigi, > > > > On Tue, Apr 23, 2019 at 11:58 AM Luigi Semenzato <semenzato@xxxxxxxxxx> wrote: > > > > > > I and others are working on improving system behavior under memory > > > pressure on Chrome OS. We use zram, which swaps to a > > > statically-configured compressed RAM disk. One challenge that we have > > > is that the footprint of our workloads is highly variable. With zram, > > > we have to set the size of the swap partition at boot time. When the > > > (logical) swap partition is full, we're left with some amount of RAM > > > usable by file and anonymous pages (we can ignore the rest). We don't > > > get to control this amount dynamically. Thus if the workload fits > > > nicely in it, everything works well. If it doesn't, then the rate of > > > anonymous page faults can be quite high, causing large CPU overhead > > > for compression/decompression (as well as for other parts of the MM). > > > > > > In Chrome OS and Android, we have the luxury that we can reduce > > > pressure by terminating processes (tab discard in Chrome OS, app kill > > > in Android---which incidentally also runs in parallel with Chrome OS > > > on some chromebooks). To help decide when to reduce pressure, we > > > would like to have a reliable and device-independent measure of MM CPU > > > overhead. I have looked into PSI and have a few questions. I am also > > > looking for alternative suggestions. > > > > > > PSI measures the times spent when some and all tasks are blocked by > > > memory allocation. In some experiments, this doesn't seem to > > > correlate too well with CPU overhead (which instead correlates fairly > > > well with page fault rates). Could this be because it includes > > > pressure from file page faults? > > > > This might be caused by thrashing (see: > > https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1114). > > > > > Is there some way of interpreting PSI > > > numbers so that the pressure from file pages is ignored? > > > > I don't think so but I might be wrong. Notice here > > https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1111 > > you could probably use delayacct to distinguish file thrashing, > > however remember that PSI takes into account the number of CPUs and > > the number of currently non-idle tasks in its pressure calculations, > > so the raw delay numbers might not be very useful here. > > OK. > > > > What is the purpose of "some" and "full" in the PSI measurements? The > > > chrome browser is a multi-process app and there is a lot of IPC. When > > > process A is blocked on memory allocation, it cannot respond to IPC > > > from process B, thus effectively both processes are blocked on > > > allocation, but we don't see that. > > > > I don't think PSI would account such an indirect stall when A is > > waiting for B and B is blocked on memory access. B's stall will be > > accounted for but I don't think A's blocked time will go into PSI > > calculations. The process inter-dependencies are probably out of scope > > for PSI. > > Right, that's what I was also saying. It would be near impossible to > figure it out. It may also be that statistically it doesn't matter, > as long as the workload characteristics don't change dramatically. > Which unfortunately they might... > > > > Also, there are situations in > > > which some "uninteresting" process keep running. So it's not clear we > > > can rely on "full". Or maybe I am misunderstanding? "Some" may be a > > > better measure, but again it doesn't measure indirect blockage. > > > > Johannes explains the SOME and FULL calculations here: > > https://elixir.bootlin.com/linux/v5.1-rc6/source/kernel/sched/psi.c#L76 > > and includes couple examples with the last one showing FULL>0 and some > > tasks still running. > > Thank you, yes, those are good explanation. I am still not sure how > to use this in our case. > > I thought about using the page fault rate as a proxy for the > allocation overhead. Unfortunately it is difficult to figure out the > baseline, because: 1. it is device-dependent (that's not > insurmountable: we could compute a per-device baseline offline); 2. > the CPUs can go in and out of turbo mode, or temperature-throttling, > and the notion of a constant "baseline" fails miserably. > > > > The kernel contains various cpustat measurements, including some > > > slightly esoteric ones such as CPUTIME_GUEST and CPUTIME_GUEST_NICE. > > > Would adding a CPUTIME_MEM be out of the question? > > Any opinion on CPUTIME_MEM? I guess some description of how you plan to calculate it would be helpful. A simple raw delay counter might not be very useful, that's why PSI performs more elaborate calculations. Maybe posting a small RFC patch with code would get more attention and you can collect more feedback. > Thanks again! > > > > Thanks! > > > > > > > Just my 2 cents and Johannes being the author might have more to say here.