Re: PSI vs. CPU overhead for client computing

Suren Baghdasaryan <surenb@xxxxxxxxxx> · Wed, 24 Apr 2019 07:49:34 -0700

On Tue, Apr 23, 2019 at 9:54 PM Luigi Semenzato <semenzato@xxxxxxxxxx> wrote:
>
> Thank you very much Suren.
>
> On Tue, Apr 23, 2019 at 3:04 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> >
> > Hi Luigi,
> >
> > On Tue, Apr 23, 2019 at 11:58 AM Luigi Semenzato <semenzato@xxxxxxxxxx> wrote:
> > >
> > > I and others are working on improving system behavior under memory
> > > pressure on Chrome OS.  We use zram, which swaps to a
> > > statically-configured compressed RAM disk.  One challenge that we have
> > > is that the footprint of our workloads is highly variable.  With zram,
> > > we have to set the size of the swap partition at boot time.  When the
> > > (logical) swap partition is full, we're left with some amount of RAM
> > > usable by file and anonymous pages (we can ignore the rest).  We don't
> > > get to control this amount dynamically.  Thus if the workload fits
> > > nicely in it, everything works well.  If it doesn't, then the rate of
> > > anonymous page faults can be quite high, causing large CPU overhead
> > > for compression/decompression (as well as for other parts of the MM).
> > >
> > > In Chrome OS and Android, we have the luxury that we can reduce
> > > pressure by terminating processes (tab discard in Chrome OS, app kill
> > > in Android---which incidentally also runs in parallel with Chrome OS
> > > on some chromebooks).  To help decide when to reduce pressure, we
> > > would like to have a reliable and device-independent measure of MM CPU
> > > overhead.  I have looked into PSI and have a few questions.  I am also
> > > looking for alternative suggestions.
> > >
> > > PSI measures the times spent when some and all tasks are blocked by
> > > memory allocation.  In some experiments, this doesn't seem to
> > > correlate too well with CPU overhead (which instead correlates fairly
> > > well with page fault rates).  Could this be because it includes
> > > pressure from file page faults?
> >
> > This might be caused by thrashing (see:
> > https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1114).
> >
> > >  Is there some way of interpreting PSI
> > > numbers so that the pressure from file pages is ignored?
> >
> > I don't think so but I might be wrong. Notice here
> > https://elixir.bootlin.com/linux/v5.1-rc6/source/mm/filemap.c#L1111
> > you could probably use delayacct to distinguish file thrashing,
> > however remember that PSI takes into account the number of CPUs and
> > the number of currently non-idle tasks in its pressure calculations,
> > so the raw delay numbers might not be very useful here.
>
> OK.
>
> > > What is the purpose of "some" and "full" in the PSI measurements?  The
> > > chrome browser is a multi-process app and there is a lot of IPC.  When
> > > process A is blocked on memory allocation, it cannot respond to IPC
> > > from process B, thus effectively both processes are blocked on
> > > allocation, but we don't see that.
> >
> > I don't think PSI would account such an indirect stall when A is
> > waiting for B and B is blocked on memory access. B's stall will be
> > accounted for but I don't think A's blocked time will go into PSI
> > calculations. The process inter-dependencies are probably out of scope
> > for PSI.
>
> Right, that's what I was also saying.  It would be near impossible to
> figure it out.  It may also be that statistically it doesn't matter,
> as long as the workload characteristics don't change dramatically.
> Which unfortunately they might...
>
> > > Also, there are situations in
> > > which some "uninteresting" process keep running.  So it's not clear we
> > > can rely on "full".  Or maybe I am misunderstanding?  "Some" may be a
> > > better measure, but again it doesn't measure indirect blockage.
> >
> > Johannes explains the SOME and FULL calculations here:
> > https://elixir.bootlin.com/linux/v5.1-rc6/source/kernel/sched/psi.c#L76
> > and includes couple examples with the last one showing FULL>0 and some
> > tasks still running.
>
> Thank you, yes, those are good explanation.  I am still not sure how
> to use this in our case.
>
> I thought about using the page fault rate as a proxy for the
> allocation overhead.  Unfortunately it is difficult to figure out the
> baseline, because: 1. it is device-dependent (that's not
> insurmountable: we could compute a per-device baseline offline); 2.
> the CPUs can go in and out of turbo mode, or temperature-throttling,
> and the notion of a constant "baseline" fails miserably.
>
> > > The kernel contains various cpustat measurements, including some
> > > slightly esoteric ones such as CPUTIME_GUEST and CPUTIME_GUEST_NICE.
> > > Would adding a CPUTIME_MEM be out of the question?
>
> Any opinion on CPUTIME_MEM?

I guess some description of how you plan to calculate it would be
helpful. A simple raw delay counter might not be very useful, that's
why PSI performs more elaborate calculations.
Maybe posting a small RFC patch with code would get more attention and
you can collect more feedback.

> Thanks again!
>
> > > Thanks!
> > >
> >
> > Just my 2 cents and Johannes being the author might have more to say here.