> On Mar 31, 2020, at 7:09 AM, David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 31.03.20 16:07, Michael S. Tsirkin wrote: >> On Tue, Mar 31, 2020 at 04:03:18PM +0200, David Hildenbrand wrote: >>> On 31.03.20 15:37, Michael S. Tsirkin wrote: >>>> On Tue, Mar 31, 2020 at 03:32:05PM +0200, David Hildenbrand wrote: >>>>> On 31.03.20 15:24, Michael S. Tsirkin wrote: >>>>>> On Tue, Mar 31, 2020 at 12:35:24PM +0200, David Hildenbrand wrote: >>>>>>> On 26.03.20 10:49, Michael S. Tsirkin wrote: >>>>>>>> On Thu, Mar 26, 2020 at 08:54:04AM +0100, David Hildenbrand wrote: >>>>>>>>>> Am 26.03.2020 um 08:21 schrieb Michael S. Tsirkin <mst@xxxxxxxxxx>: >>>>>>>>>> >>>>>>>>>> On Thu, Mar 12, 2020 at 09:51:25AM +0100, David Hildenbrand wrote: >>>>>>>>>>>> On 12.03.20 09:47, Michael S. Tsirkin wrote: >>>>>>>>>>>> On Thu, Mar 12, 2020 at 09:37:32AM +0100, David Hildenbrand wrote: >>>>>>>>>>>>> 2. You are essentially stealing THPs in the guest. So the fastest >>>>>>>>>>>>> mapping (THP in guest and host) is gone. The guest won't be able to make >>>>>>>>>>>>> use of THP where it previously was able to. I can imagine this implies a >>>>>>>>>>>>> performance degradation for some workloads. This needs a proper >>>>>>>>>>>>> performance evaluation. >>>>>>>>>>>> >>>>>>>>>>>> I think the problem is more with the alloc_pages API. >>>>>>>>>>>> That gives you exactly the given order, and if there's >>>>>>>>>>>> a larger chunk available, it will split it up. >>>>>>>>>>>> >>>>>>>>>>>> But for balloon - I suspect lots of other users, >>>>>>>>>>>> we do not want to stress the system but if a large >>>>>>>>>>>> chunk is available anyway, then we could handle >>>>>>>>>>>> that more optimally by getting it all in one go. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> So if we want to address this, IMHO this calls for a new API. >>>>>>>>>>>> Along the lines of >>>>>>>>>>>> >>>>>>>>>>>> struct page *alloc_page_range(gfp_t gfp, unsigned int min_order, >>>>>>>>>>>> unsigned int max_order, unsigned int *order) >>>>>>>>>>>> >>>>>>>>>>>> the idea would then be to return at a number of pages in the given >>>>>>>>>>>> range. >>>>>>>>>>>> >>>>>>>>>>>> What do you think? Want to try implementing that? >>>>>>>>>>> >>>>>>>>>>> You can just start with the highest order and decrement the order until >>>>>>>>>>> your allocation succeeds using alloc_pages(), which would be enough for >>>>>>>>>>> a first version. At least I don't see the immediate need for a new >>>>>>>>>>> kernel API. >>>>>>>>>> >>>>>>>>>> OK I remember now. The problem is with reclaim. Unless reclaim is >>>>>>>>>> completely disabled, any of these calls can sleep. After it wakes up, >>>>>>>>>> we would like to get the larger order that has become available >>>>>>>>>> meanwhile. >>>>>>>>> >>>>>>>>> Yes, but that‘s a pure optimization IMHO. >>>>>>>>> So I think we should do a trivial implementation first and then see what we gain from a new allocator API. Then we might also be able to justify it using real numbers. >>>>>>>> >>>>>>>> Well how do you propose implement the necessary semantics? >>>>>>>> I think we are both agreed that alloc_page_range is more or >>>>>>>> less what's necessary anyway - so how would you approximate it >>>>>>>> on top of existing APIs? >>>>>>> diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h >>>> >>>> ..... >>>> >>>> >>>>>>> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c >>>>>>> index 26de020aae7b..067810b32813 100644 >>>>>>> --- a/mm/balloon_compaction.c >>>>>>> +++ b/mm/balloon_compaction.c >>>>>>> @@ -112,23 +112,35 @@ size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info, >>>>>>> EXPORT_SYMBOL_GPL(balloon_page_list_dequeue); >>>>>>> >>>>>>> /* >>>>>>> - * balloon_page_alloc - allocates a new page for insertion into the balloon >>>>>>> - * page list. >>>>>>> + * balloon_pages_alloc - allocates a new page (of at most the given order) >>>>>>> + * for insertion into the balloon page list. >>>>>>> * >>>>>>> * Driver must call this function to properly allocate a new balloon page. >>>>>>> * Driver must call balloon_page_enqueue before definitively removing the page >>>>>>> * from the guest system. >>>>>>> * >>>>>>> + * Will fall back to smaller orders if allocation fails. The order of the >>>>>>> + * allocated page is stored in page->private. >>>>>>> + * >>>>>>> * Return: struct page for the allocated page or NULL on allocation failure. >>>>>>> */ >>>>>>> -struct page *balloon_page_alloc(void) >>>>>>> +struct page *balloon_pages_alloc(int order) >>>>>>> { >>>>>>> - struct page *page = alloc_page(balloon_mapping_gfp_mask() | >>>>>>> - __GFP_NOMEMALLOC | __GFP_NORETRY | >>>>>>> - __GFP_NOWARN); >>>>>>> - return page; >>>>>>> + struct page *page; >>>>>>> + >>>>>>> + while (order >= 0) { >>>>>>> + page = alloc_pages(balloon_mapping_gfp_mask() | >>>>>>> + __GFP_NOMEMALLOC | __GFP_NORETRY | >>>>>>> + __GFP_NOWARN, order); >>>>>>> + if (page) { >>>>>>> + set_page_private(page, order); >>>>>>> + return page; >>>>>>> + } >>>>>>> + order--; >>>>>>> + } >>>>>>> + return NULL; >>>>>>> } >>>>>>> -EXPORT_SYMBOL_GPL(balloon_page_alloc); >>>>>>> +EXPORT_SYMBOL_GPL(balloon_pages_alloc); >>>>>>> >>>>>>> /* >>>>>>> * balloon_page_enqueue - inserts a new page into the balloon page list. >>>>>> >>>>>> >>>>>> I think this will try to invoke direct reclaim from the first iteration >>>>>> to free up the max order. >>>>> >>>>> %__GFP_NORETRY: The VM implementation will try only very lightweight >>>>> memory direct reclaim to get some memory under memory pressure (thus it >>>>> can sleep). It will avoid disruptive actions like OOM killer. >>>>> >>>>> Certainly good enough for a first version I would say, no? >>>> >>>> Frankly how well that behaves would depend a lot on the workload. >>>> Can regress just as well. >>>> >>>> For the 1st version I'd prefer something that is the least disruptive, >>>> and that IMHO means we only trigger reclaim at all in the same configuration >>>> as now - when we can't satisfy the lowest order allocation. >>> >>> Agreed. >>> >>>> Anything else would be a huge amount of testing with all kind of >>>> workloads. >>> >>> So doing a "& ~__GFP_RECLAIM" in case order > 0? (as done in >>> GFP_TRANSHUGE_LIGHT) >> >> That will improve the situation when reclaim is not needed, but leave >> the problem in place for when it's needed: if reclaim does trigger, we >> can get a huge free page and immediately break it up. >> >> So it's ok as a first step but it will make the second step harder as >> we'll need to test with reclaim :). > > I expect the whole "steal huge pages from your guest" to be problematic, > as I already mentioned to Alex. This needs a performance evaluation. > > This all smells like a lot of workload dependent fine-tuning. :) AFAIK the hardware overheads of keeping huge-pages in the guest and backing them with 4KB pages are non-negligible. Did you take those into account? _______________________________________________ Virtualization mailing list Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/virtualization