Re: [RFC] memory pressure detection in VMs using PSI mechanism for dynamically inflating/deflating VM memory

Sudarshan Rajagopalan <quic_sudaraja@xxxxxxxxxxx> · Tue, 1 Aug 2023 14:20:00 -0700

On 1/23/2023 3:47 PM, Sudarshan Rajagopalan wrote:

On 1/23/2023 1:26 PM, T.J. Alumbaugh wrote:
Hi Sudarshan,

I had questions about the setup and another about the use of PSI.
Thanks for your comments Alumbaugh.
1. This will be a native userspace daemon that will be running only 
in the Linux VM which will use virtio-mem driver that uses memory 
hotplug to add/remove memory. The VM (aka Secondary VM, SVM) will 
request for memory from the host which is Primary VM, PVM via the 
backend hypervisor which takes care of cross-VM communication.

In regards to the "PVM/SVM" nomenclature, is the implied setup one of
fault tolerance (i.e. the secondary is there to take over in case of
failure of the primary VM)? Generally speaking, are the PVM and SVM
part of a defined system running some workload? The context seems to
be that the situation is more intricate than "two virtual machines
running on a host", but I'm not clear how it is different from that
general notion.

Here the Primary VM (PVM) is actually the host and we run a VM from 
this host. We simply call this newly launched VM as Secondary VM 
(SVM). Sorry for the confusion here. The secondary VM runs in a secure 
environment.

5. Detecting decrease in memory pressure b the reverse part where we 
give back memory to PVM when memory is no longer needed is bit 
tricky. We look for pressure decay and see if PSI averages (avg10, 
avg60, avg300) go down, and along with other memory stats (such as 
free memory etc) we make an educated guess that usecase has ended 
and memory has been freebed by the usecase, and this memory can be 
given back to PVM when its no longer needed.

This is also very interesting to me. Detecting a decrease in pressure
using PSI seems difficult. IIUC correctly, the approach taken in
OOMD/senpai from Meta seems to be continually applying
pressure/backing off, and then seeing the outcome of that decision on
the pressure metric to feedback to the next decision (see links
below). Is your approach similar? Do you check the metric periodically
or only when receiving PSI memory events in userspace?

https://github.com/facebookincubator/senpai/blob/main/senpai.py#L117-L148 

https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L529-L538 

We have implemented a logic where we use the PSI averages to check for 
rate of decay. If there are no new pressure events, these averages 
would decay exponentially. And we wait until {avg10, avg60, avg300} 
values reaches below a certain threshold. The logic is as follows -

usecase endsB  ->B  wait until no new pressure event occurs (this 
usually happens when all usecases ends)B  ->B  once no new pressure 
events, run check for pressure decay algorithm that simply checks 
exponential decay of averages goes below certain threshold -> once 
this happens, we make educated decision that usecase has actually 
ended ->B  check for memory stats MemFree etc (here we actually take 
memory snapshot when pressure builds up and new memory gets 
plugged-in, and compare memory snapshot when pressure decay ends, that 
way we know how much memory was plugged-in before and check if MemFree 
is in that range so that we get to know previously memory that was 
added is now no longer needed) ->B  release remaining free memory back 
to Primary VM (host).

The reason why we check for exponential decay of averages is it gives 
a clear picture that memory pressure is indeed going down, and any new 
sudden spike in pressure will be factored into increase in these 
averages and it can be observed. Rather than sampling the pressure 
during every ticks where you might miss the sudden spikes if the 
sampling time is too wide.

Another cool thing of using averages is you can calculate how long it 
will take for pressure to decay from {avg10XX, avg60XX} -> to 
{avg10TT, avg60TT} where avg10TT,... is the set threshold value. So 
you can sleep until this time and then wake up and check if averages 
have reached the threshold values. If its not, that means a new 
pressure event would have come in and suppressed the decay. This way 
we don't have to do any sampling of pressure every ticks (saves CPU 
cycles).

Very interesting proposal. Thanks for sending,

-T.J.

Resurrecting this thread to mention that we have sent the source-code of 
the userspace dynamic VM memory resizing daemon that was discussed here 
to upstream as RFC. The patches are sent as is and we will be merging 
into Github or CodeLinaro after gathering all review comments.

https://lore.kernel.org/linux-arm-kernel/cover.1690836010.git.quic_sudaraja@xxxxxxxxxxx/T/#t

Alumbaugh, David - I would be glad to get your thoughts and comments on 
this since this topic would be of interest to you both.