Re: [RFC for Linux] virtio_balloon: Add VIRTIO_BALLOON_F_THP_ORDER to handle THP spilt issue

David Hildenbrand <david@xxxxxxxxxx> · Tue, 31 Mar 2020 15:32:05 +0200

On 31.03.20 15:24, Michael S. Tsirkin wrote:
> On Tue, Mar 31, 2020 at 12:35:24PM +0200, David Hildenbrand wrote:
>> On 26.03.20 10:49, Michael S. Tsirkin wrote:
>>> On Thu, Mar 26, 2020 at 08:54:04AM +0100, David Hildenbrand wrote:
>>>>
>>>>
>>>>> Am 26.03.2020 um 08:21 schrieb Michael S. Tsirkin <mst@xxxxxxxxxx>:
>>>>>
>>>>> On Thu, Mar 12, 2020 at 09:51:25AM +0100, David Hildenbrand wrote:
>>>>>>> On 12.03.20 09:47, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Mar 12, 2020 at 09:37:32AM +0100, David Hildenbrand wrote:
>>>>>>>> 2. You are essentially stealing THPs in the guest. So the fastest
>>>>>>>> mapping (THP in guest and host) is gone. The guest won't be able to make
>>>>>>>> use of THP where it previously was able to. I can imagine this implies a
>>>>>>>> performance degradation for some workloads. This needs a proper
>>>>>>>> performance evaluation.
>>>>>>>
>>>>>>> I think the problem is more with the alloc_pages API.
>>>>>>> That gives you exactly the given order, and if there's
>>>>>>> a larger chunk available, it will split it up.
>>>>>>>
>>>>>>> But for balloon - I suspect lots of other users,
>>>>>>> we do not want to stress the system but if a large
>>>>>>> chunk is available anyway, then we could handle
>>>>>>> that more optimally by getting it all in one go.
>>>>>>>
>>>>>>>
>>>>>>> So if we want to address this, IMHO this calls for a new API.
>>>>>>> Along the lines of
>>>>>>>
>>>>>>>    struct page *alloc_page_range(gfp_t gfp, unsigned int min_order,
>>>>>>>                    unsigned int max_order, unsigned int *order)
>>>>>>>
>>>>>>> the idea would then be to return at a number of pages in the given
>>>>>>> range.
>>>>>>>
>>>>>>> What do you think? Want to try implementing that?
>>>>>>
>>>>>> You can just start with the highest order and decrement the order until
>>>>>> your allocation succeeds using alloc_pages(), which would be enough for
>>>>>> a first version. At least I don't see the immediate need for a new
>>>>>> kernel API.
>>>>>
>>>>> OK I remember now.  The problem is with reclaim. Unless reclaim is
>>>>> completely disabled, any of these calls can sleep. After it wakes up,
>>>>> we would like to get the larger order that has become available
>>>>> meanwhile.
>>>>>
>>>>
>>>> Yes, but that‘s a pure optimization IMHO.
>>>> So I think we should do a trivial implementation first and then see what we gain from a new allocator API. Then we might also be able to justify it using real numbers.
>>>>
>>>
>>> Well how do you propose implement the necessary semantics?
>>> I think we are both agreed that alloc_page_range is more or
>>> less what's necessary anyway - so how would you approximate it
>>> on top of existing APIs?
>>
>> Looking at drivers/misc/vmw_balloon.c:vmballoon_inflate(), it first
>> tries to allocate huge pages using
>>
>> 	alloc_pages(__GFP_HIGHMEM|__GFP_NOWARN| __GFP_NOMEMALLOC, 
>>                     VMW_BALLOON_2M_ORDER)
>>
>> And then falls back to 4k allocations (balloon_page_alloc()) in case
>> allocation fails.
>>
>> I'm roughly thinking of something like the following, but with an
>> optimized reporting interface/bigger pfn array so we can report >
>> 1MB at a time. Also, it might make sense to remember the order that
>> succeeded across some fill_balloon() calls.
>>
>> Don't even expect it to compile ...
>>
>>
>>
>> >From 4305f989672ccca4be9293e6d4167e929f3e299b Mon Sep 17 00:00:00 2001
>> From: David Hildenbrand <david@xxxxxxxxxx>
>> Date: Tue, 31 Mar 2020 12:28:07 +0200
>> Subject: [PATCH RFC] tmp
>>
>> Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
>> ---
>>  drivers/virtio/virtio_balloon.c    | 38 ++++++++++++++++++--------
>>  include/linux/balloon_compaction.h |  7 ++++-
>>  mm/balloon_compaction.c            | 43 +++++++++++++++++++++++-------
>>  3 files changed, 67 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
>> index 8511d258dbb4..0660b1b988f0 100644
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -187,7 +187,7 @@ int virtballoon_free_page_report(struct page_reporting_dev_info *pr_dev_info,
>>  }
>>  
>>  static void set_page_pfns(struct virtio_balloon *vb,
>> -			  __virtio32 pfns[], struct page *page)
>> +			  __virtio32 pfns[], struct page *page, int order)
>>  {
>>  	unsigned int i;
>>  
>> @@ -197,7 +197,7 @@ static void set_page_pfns(struct virtio_balloon *vb,
>>  	 * Set balloon pfns pointing at this page.
>>  	 * Note that the first pfn points at start of the page.
>>  	 */
>> -	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
>> +	for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE * (1 << order); i++)
>>  		pfns[i] = cpu_to_virtio32(vb->vdev,
>>  					  page_to_balloon_pfn(page) + i);
>>  }
>> @@ -205,6 +205,7 @@ static void set_page_pfns(struct virtio_balloon *vb,
>>  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>>  {
>>  	unsigned num_allocated_pages;
>> +	int order = MAX_ORDER - 1;
>>  	unsigned num_pfns;
>>  	struct page *page;
>>  	LIST_HEAD(pages);
>> @@ -212,9 +213,20 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>>  	/* We can only do one array worth at a time. */
>>  	num = min(num, ARRAY_SIZE(vb->pfns));
>>  
>> +	/*
>> +	 * Note: we will currently never allocate more than 1MB due to the
>> +	 * pfn array size, so we will not allocate MAX_ORDER - 1 ...
>> +	 */
>> +
>>  	for (num_pfns = 0; num_pfns < num;
>> -	     num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
>> -		struct page *page = balloon_page_alloc();
>> +	     num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE * (1 << order)) {
>> +		const unsigned long remaining = num - num_pfns;
>> +
>> +		order = MIN(order,
>> +			    get_order(remaining << VIRTIO_BALLOON_PFN_SHIFT));
>> +		if ((1 << order) * VIRTIO_BALLOON_PAGES_PER_PAGE > remaining)
>> +			order--;
>> +		page = balloon_pages_alloc(order);
>>  
>>  		if (!page) {
>>  			dev_info_ratelimited(&vb->vdev->dev,
>> @@ -225,6 +237,8 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>>  			break;
>>  		}
>>  
>> +		/* Continue with the actual order that succeeded. */
>> +		order = page_private(page);
>>  		balloon_page_push(&pages, page);
>>  	}
>>  
>> @@ -233,14 +247,16 @@ static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
>>  	vb->num_pfns = 0;
>>  
>>  	while ((page = balloon_page_pop(&pages))) {
>> +		order = page_order(page);
>> +		/* enqueuing will split the page and clear the order */
>>  		balloon_page_enqueue(&vb->vb_dev_info, page);
>>  
>> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>> -		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
>> +		set_page_pfns(vb, vb->pfns + vb->num_pfns, page, order);
>> +		vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE * (1 << order);
>>  		if (!virtio_has_feature(vb->vdev,
>>  					VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
>> -			adjust_managed_page_count(page, -1);
>> -		vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE;
>> +			adjust_managed_page_count(page, -1 * (1 << order));
>> +		vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE * (1 << order);
>>  	}
>>  
>>  	num_allocated_pages = vb->num_pfns;
>> @@ -284,7 +300,7 @@ static unsigned leak_balloon(struct virtio_balloon *vb, size_t num)
>>  		page = balloon_page_dequeue(vb_dev_info);
>>  		if (!page)
>>  			break;
>> -		set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
>> +		set_page_pfns(vb, vb->pfns + vb->num_pfns, page, 0);
>>  		list_add(&page->lru, &pages);
>>  		vb->num_pages -= VIRTIO_BALLOON_PAGES_PER_PAGE;
>>  	}
>> @@ -786,7 +802,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>>  	__count_vm_event(BALLOON_MIGRATE);
>>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
>>  	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
>> -	set_page_pfns(vb, vb->pfns, newpage);
>> +	set_page_pfns(vb, vb->pfns, newpage, 0);
>>  	tell_host(vb, vb->inflate_vq);
>>  
>>  	/* balloon's page migration 2nd step -- deflate "page" */
>> @@ -794,7 +810,7 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
>>  	balloon_page_delete(page);
>>  	spin_unlock_irqrestore(&vb_dev_info->pages_lock, flags);
>>  	vb->num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
>> -	set_page_pfns(vb, vb->pfns, page);
>> +	set_page_pfns(vb, vb->pfns, page, 0);
>>  	tell_host(vb, vb->deflate_vq);
>>  
>>  	mutex_unlock(&vb->balloon_lock);
>> diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
>> index 338aa27e4773..ed93fe5704d1 100644
>> --- a/include/linux/balloon_compaction.h
>> +++ b/include/linux/balloon_compaction.h
>> @@ -60,7 +60,7 @@ struct balloon_dev_info {
>>  	struct inode *inode;
>>  };
>>  
>> -extern struct page *balloon_page_alloc(void);
>> +extern struct page *balloon_pages_alloc(int order);
>>  extern void balloon_page_enqueue(struct balloon_dev_info *b_dev_info,
>>  				 struct page *page);
>>  extern struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info);
>> @@ -78,6 +78,11 @@ static inline void balloon_devinfo_init(struct balloon_dev_info *balloon)
>>  	balloon->inode = NULL;
>>  }
>>  
>> +static inline struct page *balloon_page_alloc(void)
>> +{
>> +	return balloon_pages_alloc(0);
>> +}
>> +
>>  #ifdef CONFIG_BALLOON_COMPACTION
>>  extern const struct address_space_operations balloon_aops;
>>  extern bool balloon_page_isolate(struct page *page,
>> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
>> index 26de020aae7b..067810b32813 100644
>> --- a/mm/balloon_compaction.c
>> +++ b/mm/balloon_compaction.c
>> @@ -112,23 +112,35 @@ size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info,
>>  EXPORT_SYMBOL_GPL(balloon_page_list_dequeue);
>>  
>>  /*
>> - * balloon_page_alloc - allocates a new page for insertion into the balloon
>> - *			page list.
>> + * balloon_pages_alloc - allocates a new page (of at most the given order)
>> + * 			 for insertion into the balloon page list.
>>   *
>>   * Driver must call this function to properly allocate a new balloon page.
>>   * Driver must call balloon_page_enqueue before definitively removing the page
>>   * from the guest system.
>>   *
>> + * Will fall back to smaller orders if allocation fails. The order of the
>> + * allocated page is stored in page->private.
>> + *
>>   * Return: struct page for the allocated page or NULL on allocation failure.
>>   */
>> -struct page *balloon_page_alloc(void)
>> +struct page *balloon_pages_alloc(int order)
>>  {
>> -	struct page *page = alloc_page(balloon_mapping_gfp_mask() |
>> -				       __GFP_NOMEMALLOC | __GFP_NORETRY |
>> -				       __GFP_NOWARN);
>> -	return page;
>> +	struct page *page;
>> +
>> +	while (order >= 0) {
>> +		page = alloc_pages(balloon_mapping_gfp_mask() |
>> +				   __GFP_NOMEMALLOC | __GFP_NORETRY |
>> +				   __GFP_NOWARN, order);
>> +		if (page) {
>> +			set_page_private(page, order);
>> +			return page;
>> +		}
>> +		order--;
>> +	}
>> +	return NULL;
>>  }
>> -EXPORT_SYMBOL_GPL(balloon_page_alloc);
>> +EXPORT_SYMBOL_GPL(balloon_pages_alloc);
>>  
>>  /*
>>   * balloon_page_enqueue - inserts a new page into the balloon page list.
> 
> 
> I think this will try to invoke direct reclaim from the first iteration
> to free up the max order.

%__GFP_NORETRY: The VM implementation will try only very lightweight
memory direct reclaim to get some memory under memory pressure (thus it
can sleep). It will avoid disruptive actions like OOM killer.

Certainly good enough for a first version I would say, no? Looking at
the vmware balloon, they don't even set __GFP_NORETRY.

-- 
Thanks,

David / dhildenb

_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization