On 29.03.19 17:51, Michael S. Tsirkin wrote: > On Fri, Mar 29, 2019 at 04:45:58PM +0100, David Hildenbrand wrote: >> On 29.03.19 16:37, David Hildenbrand wrote: >>> On 29.03.19 16:08, Michael S. Tsirkin wrote: >>>> On Fri, Mar 29, 2019 at 03:24:24PM +0100, David Hildenbrand wrote: >>>>> >>>>> We had a very simple idea in mind: As long as a hinting request is >>>>> pending, don't actually trigger any OOM activity, but wait for it to be >>>>> processed. Can be done using simple atomic variable. >>>>> >>>>> This is a scenario that will only pop up when already pretty low on >>>>> memory. And the main difference to ballooning is that we *know* we will >>>>> get more memory soon. >>>> >>>> No we don't. If we keep polling we are quite possibly keeping the CPU >>>> busy so delaying the hint request processing. Again the issue it's a >>> >>> You can always yield. But that's a different topic. >>> >>>> tradeoff. One performance for the other. Very hard to know which path do >>>> you hit in advance, and in the real world no one has the time to profile >>>> and tune things. By comparison trading memory for performance is well >>>> understood. >>>> >>>> >>>>> "appended to guest memory", "global list of memory", malicious guests >>>>> always using that memory like what about NUMA? >>>> >>>> This can be up to the guest. A good approach would be to take >>>> a chunk out of each node and add to the hints buffer. >>> >>> This might lead to you not using the buffer efficiently. But also, >>> different topic. >>> >>>> >>>>> What about different page >>>>> granularity? >>>> >>>> Seems like an orthogonal issue to me. >>> >>> It is similar, yes. But if you support multiple granularities (e.g. >>> MAX_ORDER - 1, MAX_ORDER - 2 ...) you might have to implement some sort >>> of buddy for the buffer. This is different than just a list for each node. > > Right but we don't plan to do it yet. MAX_ORDER - 2 on x86-64 seems to work just fine (no THP splits) and early performance numbers indicate it might be the right thing to do. So it could be very desirable once we do more performance tests. > >> Oh, and before I forget, different zones might of course also be a problem. > > I would just split the hint buffer evenly between zones. > Thinking about your approach, there is one elementary thing to notice: Giving the guest pages from the buffer while hinting requests are being processed means that the guest can and will temporarily make use of more memory than desired. Essentially up to the point where MADV_FREE is finally called for the hinted pages. Even then the guest will logicall make use of more memory than desired until core MM takes pages away. So: 1) Unmodified guests will make use of more memory than desired. 2) Malicious guests will make use of more memory than desired. 3) Sane, modified guests will make use of more memory than desired. Instead, we could make our life much easier by doing the following: 1) Introduce a parameter to cap the amount of memory concurrently hinted similar like you suggested, just don't consider it a buffer value. "-device virtio-balloon,hinting_size=1G". This gives us control over the hinting proceess. hinting_size=0 (default) disables hinting The admin can tweak the number along with memory requirements of the guest. We can make suggestions (e.g. calculate depending on #cores,#size of memory, or simply "1GB") 2) In the guest, track the size of hints in progress, cap at the hinting_size. 3) Document hinting behavior "When hinting is enabled, memory up to hinting_size might temporarily be removed from your guest in order to be hinted to the hypervisor. This is only for a very short time, but might affect applications. Consider the hinting_size when sizing your guest. If your application was tested with XGB and a hinting size of 1G is used, please configure X+1GB for the guest. Otherwise, performance degradation might be possible." 4) Do the loop/yield on OOM as discussed to improve performance when OOM and avoid false OOM triggers just to be sure. BTW, one alternatives I initially had in mind was to add pages from the buffer from the OOM handler only and putting these pages back into the buffer once freed. I thought this might help for certain memory offline scenarios where pages stuck in the buffer might hinder offlining of memory. And of course, improve performance as the buffer is only touched when really needed. But it would only help for memory (e.g. DIMM) added after boot, so it is also not 100% safe. Also, same issues as with your given approach. -- Thanks, David / dhildenb