Re: On guest free page hinting and OOM

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Mon, 1 Apr 2019 09:24:18 -0400

On Mon, Apr 01, 2019 at 10:17:51AM +0200, David Hildenbrand wrote:
> On 29.03.19 17:51, Michael S. Tsirkin wrote:
> > On Fri, Mar 29, 2019 at 04:45:58PM +0100, David Hildenbrand wrote:
> >> On 29.03.19 16:37, David Hildenbrand wrote:
> >>> On 29.03.19 16:08, Michael S. Tsirkin wrote:
> >>>> On Fri, Mar 29, 2019 at 03:24:24PM +0100, David Hildenbrand wrote:
> >>>>>
> >>>>> We had a very simple idea in mind: As long as a hinting request is
> >>>>> pending, don't actually trigger any OOM activity, but wait for it to be
> >>>>> processed. Can be done using simple atomic variable.
> >>>>>
> >>>>> This is a scenario that will only pop up when already pretty low on
> >>>>> memory. And the main difference to ballooning is that we *know* we will
> >>>>> get more memory soon.
> >>>>
> >>>> No we don't.  If we keep polling we are quite possibly keeping the CPU
> >>>> busy so delaying the hint request processing.  Again the issue it's a
> >>>
> >>> You can always yield. But that's a different topic.
> >>>
> >>>> tradeoff. One performance for the other. Very hard to know which path do
> >>>> you hit in advance, and in the real world no one has the time to profile
> >>>> and tune things. By comparison trading memory for performance is well
> >>>> understood.
> >>>>
> >>>>
> >>>>> "appended to guest memory", "global list of memory", malicious guests
> >>>>> always using that memory like what about NUMA?
> >>>>
> >>>> This can be up to the guest. A good approach would be to take
> >>>> a chunk out of each node and add to the hints buffer.
> >>>
> >>> This might lead to you not using the buffer efficiently. But also,
> >>> different topic.
> >>>
> >>>>
> >>>>> What about different page
> >>>>> granularity?
> >>>>
> >>>> Seems like an orthogonal issue to me.
> >>>
> >>> It is similar, yes. But if you support multiple granularities (e.g.
> >>> MAX_ORDER - 1, MAX_ORDER - 2 ...) you might have to implement some sort
> >>> of buddy for the buffer. This is different than just a list for each node.
> > 
> > Right but we don't plan to do it yet.
> 
> MAX_ORDER - 2 on x86-64 seems to work just fine (no THP splits) and
> early performance numbers indicate it might be the right thing to do. So
> it could be very desirable once we do more performance tests.
> 
> > 
> >> Oh, and before I forget, different zones might of course also be a problem.
> > 
> > I would just split the hint buffer evenly between zones.
> > 
> 
> Thinking about your approach, there is one elementary thing to notice:
> 
> Giving the guest pages from the buffer while hinting requests are being
> processed means that the guest can and will temporarily make use of more
> memory than desired. Essentially up to the point where MADV_FREE is
> finally called for the hinted pages.

Right - but that seems like exactly the reverse of the issue with the current
approach which is guest can temporarily use less memory than desired.

> Even then the guest will logicall
> make use of more memory than desired until core MM takes pages away.

That sounds more like a host issue though. If it wants to
it can use e.g. MAD_DONTNEED.

> So:
> 1) Unmodified guests will make use of more memory than desired.

One interesting possibility for this is to add the buffer memory
by hotplug after the feature has been negotiated.
I agree this sounds complex.

But I have an idea: how about we include the hint size in the
num_pages counter? Then unmodified guests put
it in the balloon and don't use it. Modified ones
will know to use it just for hinting.

> 2) Malicious guests will make use of more memory than desired.

Well this limitation is fundamental to balloon right?
If host wants to add tracking of balloon memory, it
can enforce the limits. So far no one bothered,
but maybe with this feature we should start to do that.

> 3) Sane, modified guests will make use of more memory than desired.
>
> Instead, we could make our life much easier by doing the following:
> 
> 1) Introduce a parameter to cap the amount of memory concurrently hinted
> similar like you suggested, just don't consider it a buffer value.
> "-device virtio-balloon,hinting_size=1G". This gives us control over the
> hinting proceess.
> 
> hinting_size=0 (default) disables hinting
> 
> The admin can tweak the number along with memory requirements of the
> guest. We can make suggestions (e.g. calculate depending on #cores,#size
> of memory, or simply "1GB")

So if it's all up to the guest and for the benefit of the guest, and
with no cost/benefit to the host, then why are we supplying this value
from the host?

> 2) In the guest, track the size of hints in progress, cap at the
> hinting_size.
> 
> 3) Document hinting behavior
> 
> "When hinting is enabled, memory up to hinting_size might temporarily be
> removed from your guest in order to be hinted to the hypervisor. This is
> only for a very short time, but might affect applications. Consider the
> hinting_size when sizing your guest. If your application was tested with
> XGB and a hinting size of 1G is used, please configure X+1GB for the
> guest. Otherwise, performance degradation might be possible."

OK, so let's start with this. Now let us assume that guest follows
the advice.  We thus know that 1GB is not needed for guest applications.
So why do we want to allow applications to still use this extra memory?

> 4) Do the loop/yield on OOM as discussed to improve performance when OOM
> and avoid false OOM triggers just to be sure.

Yes, I'm not against trying the simpler approach as a first step.  But
then we need this path actually tested so see whether hinting introduced
unreasonable overhead on this path.  And it is tricky to test oom as you
are skating close to system's limits. That's one reason I prefer
avoiding oom handler if possible.

When you say yield, I would guess that would involve config space access
to the balloon to flush out outstanding hints?

> 
> BTW, one alternatives I initially had in mind was to add pages from the
> buffer from the OOM handler only and putting these pages back into the
> buffer once freed.

I don't think that works easily - pages get used so we can't
return them into the buffer. Another problem with only handling oom
is that oom is a guest decision. So host really can't
enforce any limits even if it wants to.

> I thought this might help for certain memory offline
> scenarios where pages stuck in the buffer might hinder offlining of
> memory. And of course, improve performance as the buffer is only touched
> when really needed. But it would only help for memory (e.g. DIMM) added
> after boot, so it is also not 100% safe. Also, same issues as with your
> given approach.

So you can look at this approach as a combination of
- balloon inflate with separate accounting
- deflate on oom
- hinting
?

Put this way, it seems rather uncontroversial, right?

> -- 
> 
> Thanks,
> 
> David / dhildenb