Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages

Nitesh Narayan Lal <nitesh@xxxxxxxxxx> · Fri, 8 Feb 2019 15:41:55 -0500



On 2/8/19 12:58 PM, Alexander Duyck wrote:
> On Thu, Feb 7, 2019 at 12:50 PM Nitesh Narayan Lal <nitesh@xxxxxxxxxx> wrote:
>>
>> On 2/7/19 12:43 PM, Alexander Duyck wrote:
>>> On Tue, Feb 5, 2019 at 3:21 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
>>>> On Tue, Feb 05, 2019 at 04:54:03PM -0500, Nitesh Narayan Lal wrote:
>>>>> On 2/5/19 3:45 PM, Michael S. Tsirkin wrote:
>>>>>> On Mon, Feb 04, 2019 at 03:18:53PM -0500, Nitesh Narayan Lal wrote:
>>>>>>> This patch enables the kernel to scan the per cpu array and
>>>>>>> compress it by removing the repetitive/re-allocated pages.
>>>>>>> Once the per cpu array is completely filled with pages in the
>>>>>>> buddy it wakes up the kernel per cpu thread which re-scans the
>>>>>>> entire per cpu array by acquiring a zone lock corresponding to
>>>>>>> the page which is being scanned. If the page is still free and
>>>>>>> present in the buddy it tries to isolate the page and adds it
>>>>>>> to another per cpu array.
>>>>>>>
>>>>>>> Once this scanning process is complete and if there are any
>>>>>>> isolated pages added to the new per cpu array kernel thread
>>>>>>> invokes hyperlist_ready().
>>>>>>>
>>>>>>> In hyperlist_ready() a hypercall is made to report these pages to
>>>>>>> the host using the virtio-balloon framework. In order to do so
>>>>>>> another virtqueue 'hinting_vq' is added to the balloon framework.
>>>>>>> As the host frees all the reported pages, the kernel thread returns
>>>>>>> them back to the buddy.
>>>>>>>
>>>>>>> Signed-off-by: Nitesh Narayan Lal <nitesh@xxxxxxxxxx>
>>>>>> This looks kind of like what early iterations of Wei's patches did.
>>>>>>
>>>>>> But this has lots of issues, for example you might end up with
>>>>>> a hypercall per a 4K page.
>>>>>> So in the end, he switched over to just reporting only
>>>>>> MAX_ORDER - 1 pages.
>>>>> You mean that I should only capture/attempt to isolate pages with order
>>>>> MAX_ORDER - 1?
>>>>>> Would that be a good idea for you too?
>>>>> Will it help if we have a threshold value based on the amount of memory
>>>>> captured instead of the number of entries/pages in the array?
>>>> This is what Wei's patches do at least.
>>> So in the solution I had posted I was looking more at
>>> HUGETLB_PAGE_ORDER and above as the size of pages to provide the hints
>>> on [1]. The advantage to doing that is that you can also avoid
>>> fragmenting huge pages which in turn can cause what looks like a
>>> memory leak as the memory subsystem attempts to reassemble huge
>>> pages[2]. In my mind a 2MB page makes good sense in terms of the size
>>> of things to be performing hints on as anything smaller than that is
>>> going to just end up being a bunch of extra work and end up causing a
>>> bunch of fragmentation.
>> As per my opinion, in any implementation which page size to store before
>> reporting depends on the allocation pattern of the workload running in
>> the guest.
> I suggest you take a look at item 2 that I had called out in the
> previous email. There are known issues with providing hints smaller
> than THP using MADV_DONTNEED or MADV_FREE. Specifically what will
> happen is that you end up breaking up a higher order transparent huge
> page, backfilling a few holes with other pages, but then the memory
> allocation subsystem attempts to reassemble the larger THP page
> resulting in an application exhibiting behavior similar to a memory
> leak while not actually allocating memory since it is sitting on
> fragments of THP pages.
I will look into this.
>
> Also while I am thinking of it I haven't noticed anywhere that you are
> handling the case of a device assigned to the guest. That seems like a
> spot where we are going to have to stop hinting as well aren't we?
> Otherwise we would need to redo the memory mapping of the guest in the
> IOMMU every time a page is evicted and replaced.
 I haven't explored such a use-case as of now but will definitely
explore it.
>
>> I am also planning to try Michael's suggestion of using MAX_ORDER - 1.
>> However I am still thinking about a workload which I can use to test its
>> effectiveness.
> You might want to look at doing something like min(MAX_ORDER - 1,
> HUGETLB_PAGE_ORDER). I know for x86 a 2MB page is the upper limit for
> THP which is the most likely to be used page size with the guest.
Sure, thanks for the suggestion.
>
>>> The only issue with limiting things on an arbitrary boundary like that
>>> is that you have to hook into the buddy allocator to catch the cases
>>> where a page has been merged up into that range.
>> I don't think, I understood your comment completely. In any case, we
>> have to rely on the buddy for merging the pages.
>>> [1] https://lkml.org/lkml/2019/2/4/903
>>> [2] https://blog.digitalocean.com/transparent-huge-pages-and-alternative-memory-allocators/
>> --
>> Regards
>> Nitesh
>>
-- 
Regards
Nitesh

Attachment:
signature.asc

Description: OpenPGP digital signature