On Friday, June 29, 2018 7:54 PM, David Hildenbrand wrote: > On 29.06.2018 13:31, Wei Wang wrote: > > On 06/29/2018 03:46 PM, David Hildenbrand wrote: > >>> > >>> I'm afraid it can't. For example, when we have a guest booted, > >>> without too many memory activities. Assume the guest has 8GB free > >>> memory. The arch_free_page there won't be able to capture the 8GB > >>> free pages since there is no free() called. This results in no free pages > reported to host. > >> > >> So, it takes some time from when the guest boots up until the balloon > >> device was initialized and therefore page hinting can start. For that > >> period, you won't get any arch_free_page()/page hinting callbacks, correct. > >> > >> However in the hypervisor, you can theoretically track which pages > >> the guest actually touched ("dirty"), so you already know "which > >> pages were never touched while booting up until virtio-balloon was > >> brought to life". These, you can directly exclude from migration. No > >> interface required. > >> > >> The remaining problem is pages that were touched ("allocated") by the > >> guest during bootup but freed again, before virtio-balloon came up. > >> One would have to measure how many pages these usually are, I would > >> say it would not be that many (because recently freed pages are > >> likely to be used again next for allocation). However, there are some > >> pages not being reported. > >> > >> During the lifetime of the guest, this should not be a problem, > >> eventually one of these pages would get allocated/freed again, so the > >> problem "solves itself over time". You are looking into the special > >> case of migrating the VM just after it has been started. But we have > >> the exact same problem also for ordinary free page hinting, so we > >> should rather solve that problem. It is not migration specific. > >> > >> If we are looking for an alternative to "problem solves itself", > >> something like "if virtio-balloon comes up, it will report all free > >> pages step by step using free page hinting, just like we would have > >> from "arch_free_pages()"". This would be the same interface we are > >> using for free page hinting - and it could even be made configurable in the > guest. > >> > >> The current approach we are discussing internally for details about > >> Nitesh's work ("how the magic inside arch_fee_pages() will work > >> efficiently) would allow this as far as I can see just fine. > >> > >> There would be a tiny little window between virtio-balloon comes up > >> and it has reported all free pages step by step, but that can be > >> considered a very special corner case that I would argue is not worth > >> it to be optimized. > >> > >> If I am missing something important here, sorry in advance :) > >> > > > > Probably I didn't explain that well. Please see my re-try: > > > > That work is to monitor page allocation and free activities via > > arch_alloc_pages and arch_free_pages. It has per-CPU lists to record > > the pages that are freed to the mm free list, and the per-CPU lists > > dump the recorded pages to a global list when any of them is full. > > So its own per-CPU list will only be able to get free pages when there > > is an mm free() function gets called. If we have 8GB free memory on > > the mm free list, but no application uses them and thus no mm free() > > calls are made. In that case, the arch_free_pages isn't called, and no > > free pages added to the per-CPU list, but we have 8G free memory right > > on the mm free list. > > How would you guarantee the per-CPU lists have got all the free pages > > that the mm free lists have? > > As I said, if we have some mechanism that will scan the free pages (not > arch_free_page() once and report hints using the same mechanism step by > step (not your bulk interface)), this problem is solved. And as I said, this is > not a migration specific problem, we have the same problem in the current > page hinting RFC. These pages have to be reported. > > > > > - I'm also worried about the overhead of maintaining so many per-CPU > > lists and the global list. For example, if we have applications > > frequently allocate and free 4KB pages, and each per-CPU list needs to > > implement the buddy algorithm to sort and merge neighbor pages. Today > > a server can have more than 100 CPUs, then there will be more than 100 > > per-CPU lists which need to sync to a global list under a lock, I'm > > not sure if this would scale well. > > The overhead in the current RFC is definitely too high. But I consider this a > problem to be solved before page hinting would go upstream. And we are > discussing right now "if we have a reasonable page hinting implementation, > why would we need your interface in addition". > > > > > - This seems to be a burden imposed on the core mm memory > > allocation/free path. The whole overhead needs to be carried during > > the whole system life cycle. What we actually expected is to just make > > one call to get the free page hints only when live migration happens. > > You're focusing too much on the actual implementation of the page hinting > RFC right now. Assume for now that we would have > - efficient page hinting without degrading other CPUs and little > overhead > - a mechanism that solves reporting free pages once after we started up > virtio-balloon and actual free page hinting starts > > Why would your suggestion still be applicable? > > Your point for now is "I might not want to have page hinting enabled due to > the overhead, but still a live migration speedup". If that overhead actually > exists (we'll have to see) or there might be another reason to disable page > hinting, then we have to decide if that specific setup is worth it merging your > changes. All the above "if we have", "assume we have" don't sound like a valid argument to me. > I am not (and don't want to be) in the position to make any decisions here :) I > just want to understand if two interfaces for free pages actually make sense. I responded to Nitesh about the differences, you may want to check with him about this. I would suggest you to send out your patches to LKML to get a discussion with the mm folks. Best, Wei