On Wed, Feb 13, 2019 at 07:17:13AM -0500, Nitesh Narayan Lal wrote: > > On 2/13/19 4:19 AM, David Hildenbrand wrote: > > On 13.02.19 09:55, Wang, Wei W wrote: > >> On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote: > >>> Global means all VCPUs will be competing potentially for a single lock when > >>> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing > >>> memory like crazy? > >> I think the key point is that the 64 vcpus won't allocate/free on the same page simultaneously, so no need to have a global big lock, isn’t it? > >> I think atomic operations on the bitmap would be enough. > > If you have to resize/alloc/coordinate who will report, you will need > > locking. Especially, I doubt that there is an atomic xbitmap (prove me > > wrong :) ). > > > >>> (I assume some kind of locking is required even if the bitmap would be > >>> atomic. Also, doesn't xbitmap mean that we eventually have to allocate > >>> memory at places where we don't want to - e.g. from arch_free_page ?) > >> arch_free_pages is in free_pages_prepare, why can't we have memory allocation there? > > I remember we were stumbling over some issues that were non-trivial. I > > am not 100% sure yet anymore, but allocating memory while deep down in > > the freeing part of MM core smells like "be careful". > > > >> It would also be doable to find a preferred place to preallocate some amount of memory for the bitmap. > > That makes things very ugly. Especially, preallocation will most likely > > require locking. > > > >>> That's the big benefit of taking the pages of the buddy free list. Other VCPUs > >>> won't stumble over them, waiting for them to get freed in the hypervisor. > >> As also mentioned above, I think other vcpus will not allocate/free on the same page that is in progress of being allocated/freed. > > If a page is in the buddy but stuck in some other bitmap, there is > > nothing stopping another VCPU from trying to allocate it. Nitesh has > > been fighting with this problem already :) > > > >>> This sounds more like "the host requests to get free pages once in a while" > >>> compared to "the host is always informed about free pages". At the time > >>> where the host actually has to ask the guest (e.g. because the host is low on > >>> memory), it might be to late to wait for guest action. > >> Option 1: Host asks for free pages: > >> Not necessary to ask only when the host has been in memory pressure. > >> This could be the orchestration layer's job to monitor the host memory usage. > >> For example, people could set the condition "when 50% of the host memory > >> has been used, start to ask a guest for some amount of free pages" > >> > >> Option 2: Guest actively offers free pages: > >> Add a balloon callback to arch_free_page so that whenever a page gets freed its gfn > >> will be filled into the balloon's report_vq and the host will take away the backing > >> host page. > >> > >> Both options can be implemented. But I think option 1 would be more > >> efficient as the guest free pages are offered on demand. > > Yes, but as I mentioned this has other drawbacks. Relying on a a guest > > to free up memory when you really need it is not going to work. It might > > work for some scenarios but should not dictate the design. It is a good > > start though if it makes things easier. > > > > Enabling/disabling free page hintning by the hypervisor via some > > mechanism is on the other hand a good idea. "I have plenty of free > > space, don't worry". > > > >>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as > >>> candidates for removal and if the host is low on memory, only scanning the > >>> guest page tables is sufficient to free up memory. > >>> > >>> But both points might just be an implementation detail in the example you > >>> describe. > >> Yes, it is an implementation detail. I think DONTNEED would be easier > >> for the first step. > >> > >>>> In above 2), get_free_page_hints clears the bits which indicates that those > >>> pages are not ready to be used by the guest yet. Why? > >>>> This is because 3) will unmap the underlying physical pages from EPT. > >>> Normally, when guest re-visits those pages, EPT violations and QEMU page > >>> faults will get a new host page to set up the related EPT entry. If guest uses > >>> that page before the page gets unmapped (i.e. right before step 3), no EPT > >>> violation happens and the guest will use the same physical page that will be > >>> unmapped and given to other host threads. So we need to make sure that > >>> the guest free page is usable only after step 3 finishes. > >>>> Back to arch_alloc_page(), it needs to check if the allocated pages > >>>> have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it > >>> means step 2) above has happened and step 4) hasn't been reached. In this > >>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done > >>> for that page Or better to have a balloon callback which prioritize 3) and 4) > >>> to make this page usable by the guest. > >>> > >>> Regarding the latter, the VCPU allocating a page cannot do anything if the > >>> page (along with other pages) is just being freed by the hypervisor. > >>> It has to busy-wait, no chance to prioritize. > >> I meant this: > >> With this approach, essentially the free pages have 2 states: > >> ready free page: the page is on the free list and it has "1" in the bitmap > >> non-ready free page: the page is on the free list and it has "0" in the bitmap > >> Ready free pages are those who can be allocated to use. > >> Non-ready free pages are those who are in progress of being reported to > >> host and the related EPT mapping is about to be zapped. > >> > >> The non-ready pages are inserted into the report_vq and waiting for the > >> host to zap the mappings one by one. After the mapping gets zapped > >> (which means the backing host page has been taken away), host acks to > >> the guest to mark the free page as ready free page (set the bit to 1 in the bitmap). > > Yes, that's how I understood your approach. The interesting part is > > where somebody finds a buddy page and wants to allocate it. > > > >> So the non-ready free page may happen to be used when they are waiting in > >> the report_vq to be handled by the host to zap the mapping, balloon could > >> have a fast path to notify the host: > >> "page 0x1000 is about to be used, don’t zap the mapping when you get > >> 0x1000 from the report_vq" /*option [1] */ > > This requires coordination and in any case there will be a scenario > > where you have to wait for the hypervisor to eventually finish a madv > > call. You can just try to make that scenario less likely. > > > > What you propose is synchronous in the worst case. Getting pages of the > > buddy makes it possible to have it done completely asynchronous. Nobody > > allocating a page has to wait. > > > >> Or > >> > >> "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above, > >> so that the free page will be marked as ready free page and the guest can use it". > >> This option will generate an extra EPT violation and QEMU page fault to get a new host > >> page to back the guest ready free page. > > Again, coordination with the hypervisor while allocating a page. That is > > to be avoided in any case. > > > >>>> Using bitmaps to record free page hints don't need to take the free pages > >>> off the buddy list and return them later, which needs to go through the long > >>> allocation/free code path. > >>> Yes, but it means that any process is able to get stuck on such a page for as > >>> long as it takes to report the free pages to the hypervisor and for it to call > >>> madvise(pfn_start, DONTNEED) on any such page. > >> This only happens when the guest thread happens to get allocated on a page which is > >> being reported to the host. Using option [1] above will avoid this. > > I think getting pages out of the buddy system temporarily is the only > > way we can avoid somebody else stumbling over a page currently getting > > reported by the hypervisor. Otherwise, as I said, there are scenarios > > where a allocating VCPU has to wait for the hypervisor to finish the > > "freeing" task. While you can try to "speedup" that scenario - > > "hypervisor please prioritize" you cannot avoid it. There will be busy > > waiting. > > > > I don't believe what you describe is going to work (especially the not > > locking part when working with global resources). > > > > What would be interesting is to see if something like a xbitmap could be > > used instead of the per-vcpu list. > Yeap, exactly. > > Nitesh, do you remember what the > > problem was with allocating memory from these hooks? Was it a locking issue? > In the previous implementation, the issue was due to the locking. In the > current implementation having an allocation under these hooks will > result in lots of isolation failures under memory pressure. But then we shouldn't be giving host memory when under pressure at all, should we? > By the above statement, if you are referring to having a dynamic array > to hold the freed pages. > Then, that is an idea Andrea also suggested to get around this fixed > array size issue. > > > > Thanks! > > > >> Best, > >> Wei > >> > > > -- > Regards > Nitesh >