On 02.04.19 17:04, Alexander Duyck wrote: > On Tue, Apr 2, 2019 at 12:42 AM David Hildenbrand <david@xxxxxxxxxx> wrote: >> >> On 01.04.19 22:56, Alexander Duyck wrote: >>> On Mon, Apr 1, 2019 at 7:47 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: >>>> >>>> On Mon, Apr 01, 2019 at 04:11:42PM +0200, David Hildenbrand wrote: >>>>>> The interesting thing is most probably: Will the hinting size usually be >>>>>> reasonable small? At least I guess a guest with 4TB of RAM will not >>>>>> suddenly get a hinting size of hundreds of GB. Most probably also only >>>>>> something in the range of 1GB. But this is an interesting question to >>>>>> look into. >>>>>> >>>>>> Also, if the admin does not care about performance implications when >>>>>> already close to hinting, no need to add the additional 1Gb to the ram size. >>>>> >>>>> "close to OOM" is what I meant. >>>> >>>> Problem is, host admin is the one adding memory. Guest admin is >>>> the one that knows about performance. >>> >>> The thing we have to keep in mind with this is that we are not dealing >>> with the same behavior as the balloon driver. We don't need to inflate >>> a massive hint and hand that off. Instead we can focus on performing >>> the hints on much smaller amounts and do it incrementally over time >>> with the idea being as the system sits idle it frees up more and more >>> of the inactive memory on the system. >>> >>> With that said, I still don't like the idea of us even trying to >>> target 1GB of RAM for hinting. I think it would be much better if we >>> stuck to smaller sizes and kept things down to a single digit multiple >>> of THP or higher order pages. Maybe something like 64MB of total >>> memory out for hinting. >> >> 1GB was just a number I came up with. But please note, as VCPUs hint in >> parallel, even though each request is only 64MB in size, things can sum up. > > Why do we need them running in parallel for a single guest? I don't > think we need the hints so quickly that we would need to have multiple > VCPUs running in parallel to provide hints. In addition as it > currently stands in order to get pages into and out of the buddy > allocator we are going to have to take the zone lock anyway so we > could probably just assume a single thread for pulling the memory, > placing it on the ring, and putting it back into the buddy allocator > after the hint has been completed. > >>> >>> All we really would need to make it work would be to possibly look at >>> seeing if we can combine PageType values. Specifically what I would be >>> looking at is a transition that looks something like Buddy -> Offline >>> -> (Buddy | Offline). We would have to hold the zone lock at each >>> transition, but that shouldn't be too big of an issue. If we are okay >>> with possibly combining the Offline and Buddy types we would have a >>> way of tracking which pages have been hinted and which have not. Then >>> we would just have to have a thread running in the background on the >>> guest that is looking at the higher order pages and pulling 64MB at a >>> time offline, and when the hinting is done put them back in the "Buddy >>> | Offline" state. >> >> That approach may have other issues to solve (1 thread vs. many VCPUs, >> scanning all buddy pages over and over again) and other implications >> that might be undesirable (hints performed even more delayed, additional >> thread activity). I wouldn't call it the ultimate solution. > > So the problem with trying to provide the hint sooner is that you end > up creating a bottle-neck or you end up missing hints on pages > entirely and then have to fall back to such an approach. By just > letting the thread run in the background reporting the idle memory we > can avoid much of that. BTW, what you propose was already suggested in a similar form by Wei some (weeks? months?) ago. Back then I thought about something like an "escalation" mode. If too much MM activity is going on (e.g. close to OOM, dropping hints, whatever), temporarily stop ordinary hinting and do basically what you describe. But I am not sure if dropping hints is actually still a problem. -- Thanks, David / dhildenb