Re: On guest free page hinting and OOM

David Hildenbrand <david@xxxxxxxxxx> · Mon, 1 Apr 2019 10:17:51 +0200

On 29.03.19 17:51, Michael S. Tsirkin wrote:
> On Fri, Mar 29, 2019 at 04:45:58PM +0100, David Hildenbrand wrote:
>> On 29.03.19 16:37, David Hildenbrand wrote:
>>> On 29.03.19 16:08, Michael S. Tsirkin wrote:
>>>> On Fri, Mar 29, 2019 at 03:24:24PM +0100, David Hildenbrand wrote:
>>>>>
>>>>> We had a very simple idea in mind: As long as a hinting request is
>>>>> pending, don't actually trigger any OOM activity, but wait for it to be
>>>>> processed. Can be done using simple atomic variable.
>>>>>
>>>>> This is a scenario that will only pop up when already pretty low on
>>>>> memory. And the main difference to ballooning is that we *know* we will
>>>>> get more memory soon.
>>>>
>>>> No we don't.  If we keep polling we are quite possibly keeping the CPU
>>>> busy so delaying the hint request processing.  Again the issue it's a
>>>
>>> You can always yield. But that's a different topic.
>>>
>>>> tradeoff. One performance for the other. Very hard to know which path do
>>>> you hit in advance, and in the real world no one has the time to profile
>>>> and tune things. By comparison trading memory for performance is well
>>>> understood.
>>>>
>>>>
>>>>> "appended to guest memory", "global list of memory", malicious guests
>>>>> always using that memory like what about NUMA?
>>>>
>>>> This can be up to the guest. A good approach would be to take
>>>> a chunk out of each node and add to the hints buffer.
>>>
>>> This might lead to you not using the buffer efficiently. But also,
>>> different topic.
>>>
>>>>
>>>>> What about different page
>>>>> granularity?
>>>>
>>>> Seems like an orthogonal issue to me.
>>>
>>> It is similar, yes. But if you support multiple granularities (e.g.
>>> MAX_ORDER - 1, MAX_ORDER - 2 ...) you might have to implement some sort
>>> of buddy for the buffer. This is different than just a list for each node.
> 
> Right but we don't plan to do it yet.

MAX_ORDER - 2 on x86-64 seems to work just fine (no THP splits) and
early performance numbers indicate it might be the right thing to do. So
it could be very desirable once we do more performance tests.

> 
>> Oh, and before I forget, different zones might of course also be a problem.
> 
> I would just split the hint buffer evenly between zones.
> 

Thinking about your approach, there is one elementary thing to notice:

Giving the guest pages from the buffer while hinting requests are being
processed means that the guest can and will temporarily make use of more
memory than desired. Essentially up to the point where MADV_FREE is
finally called for the hinted pages. Even then the guest will logicall
make use of more memory than desired until core MM takes pages away.

So:
1) Unmodified guests will make use of more memory than desired.
2) Malicious guests will make use of more memory than desired.
3) Sane, modified guests will make use of more memory than desired.

Instead, we could make our life much easier by doing the following:

1) Introduce a parameter to cap the amount of memory concurrently hinted
similar like you suggested, just don't consider it a buffer value.
"-device virtio-balloon,hinting_size=1G". This gives us control over the
hinting proceess.

hinting_size=0 (default) disables hinting

The admin can tweak the number along with memory requirements of the
guest. We can make suggestions (e.g. calculate depending on #cores,#size
of memory, or simply "1GB")

2) In the guest, track the size of hints in progress, cap at the
hinting_size.

3) Document hinting behavior

"When hinting is enabled, memory up to hinting_size might temporarily be
removed from your guest in order to be hinted to the hypervisor. This is
only for a very short time, but might affect applications. Consider the
hinting_size when sizing your guest. If your application was tested with
XGB and a hinting size of 1G is used, please configure X+1GB for the
guest. Otherwise, performance degradation might be possible."

4) Do the loop/yield on OOM as discussed to improve performance when OOM
and avoid false OOM triggers just to be sure.

BTW, one alternatives I initially had in mind was to add pages from the
buffer from the OOM handler only and putting these pages back into the
buffer once freed. I thought this might help for certain memory offline
scenarios where pages stuck in the buffer might hinder offlining of
memory. And of course, improve performance as the buffer is only touched
when really needed. But it would only help for memory (e.g. DIMM) added
after boot, so it is also not 100% safe. Also, same issues as with your
given approach.

-- 

Thanks,

David / dhildenb