Re: PSI vs. CPU overhead for client computing

Luigi Semenzato <semenzato@xxxxxxxxxx> · Tue, 23 Apr 2019 21:54:31 -0700

Thank you very much Suren.

On Tue, Apr 23, 2019 at 3:04 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
> Hi Luigi,
>
> On Tue, Apr 23, 2019 at 11:58 AM Luigi Semenzato <semenzato@xxxxxxxxxx> wrote:
> >
> > I and others are working on improving system behavior under memory
> > pressure on Chrome OS.  We use zram, which swaps to a
> > statically-configured compressed RAM disk.  One challenge that we have
> > is that the footprint of our workloads is highly variable.  With zram,
> > we have to set the size of the swap partition at boot time.  When the
> > (logical) swap partition is full, we're left with some amount of RAM
> > usable by file and anonymous pages (we can ignore the rest).  We don't
> > get to control this amount dynamically.  Thus if the workload fits
> > nicely in it, everything works well.  If it doesn't, then the rate of
> > anonymous page faults can be quite high, causing large CPU overhead
> > for compression/decompression (as well as for other parts of the MM).
> >
> > In Chrome OS and Android, we have the luxury that we can reduce
> > pressure by terminating processes (tab discard in Chrome OS, app kill
> > in Android---which incidentally also runs in parallel with Chrome OS
> > on some chromebooks).  To help decide when to reduce pressure, we
> > would like to have a reliable and device-independent measure of MM CPU
> > overhead.  I have looked into PSI and have a few questions.  I am also
> > looking for alternative suggestions.
> >
> > PSI measures the times spent when some and all tasks are blocked by
> > memory allocation.  In some experiments, this doesn't seem to
> > correlate too well with CPU overhead (which instead correlates fairly
> > well with page fault rates).  Could this be because it includes
> > pressure from file page faults?
>
> This might be caused by thrashing (see:
> https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1114).
>
> >  Is there some way of interpreting PSI
> > numbers so that the pressure from file pages is ignored?
>
> I don't think so but I might be wrong. Notice here
> https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1111
> you could probably use delayacct to distinguish file thrashing,
> however remember that PSI takes into account the number of CPUs and
> the number of currently non-idle tasks in its pressure calculations,
> so the raw delay numbers might not be very useful here.

OK.

> > What is the purpose of "some" and "full" in the PSI measurements?  The
> > chrome browser is a multi-process app and there is a lot of IPC.  When
> > process A is blocked on memory allocation, it cannot respond to IPC
> > from process B, thus effectively both processes are blocked on
> > allocation, but we don't see that.
>
> I don't think PSI would account such an indirect stall when A is
> waiting for B and B is blocked on memory access. B's stall will be
> accounted for but I don't think A's blocked time will go into PSI
> calculations. The process inter-dependencies are probably out of scope
> for PSI.

Right, that's what I was also saying.  It would be near impossible to
figure it out.  It may also be that statistically it doesn't matter,
as long as the workload characteristics don't change dramatically.
Which unfortunately they might...

> > Also, there are situations in
> > which some "uninteresting" process keep running.  So it's not clear we
> > can rely on "full".  Or maybe I am misunderstanding?  "Some" may be a
> > better measure, but again it doesn't measure indirect blockage.
>
> Johannes explains the SOME and FULL calculations here:
> https://elixir.bootlin.com/linux/v5.1-rc6/source/kernel/sched/psi.c#L76
> and includes couple examples with the last one showing FULL>0 and some
> tasks still running.

Thank you, yes, those are good explanation.  I am still not sure how
to use this in our case.

I thought about using the page fault rate as a proxy for the
allocation overhead.  Unfortunately it is difficult to figure out the
baseline, because: 1. it is device-dependent (that's not
insurmountable: we could compute a per-device baseline offline); 2.
the CPUs can go in and out of turbo mode, or temperature-throttling,
and the notion of a constant "baseline" fails miserably.

> > The kernel contains various cpustat measurements, including some
> > slightly esoteric ones such as CPUTIME_GUEST and CPUTIME_GUEST_NICE.
> > Would adding a CPUTIME_MEM be out of the question?

Any opinion on CPUTIME_MEM?

Thanks again!

> > Thanks!
> >
>
> Just my 2 cents and Johannes being the author might have more to say here.