Re: Question: Using online_pages/offline_pages() with granularity < mem section size

David Hildenbrand <david@xxxxxxxxxx> · Wed, 4 Apr 2018 11:08:03 +0200

On 03.03.2018 18:53, Dan Williams wrote:
> On Fri, Mar 2, 2018 at 7:23 AM, David Hildenbrand <david@xxxxxxxxxx> wrote:
>> Hi,
>>
>> in the context of virtualization, I am experimenting right now with an
>> approach to plug/unplug memory using a paravirtualized interface(not
>> ACPI). And I stumbled over certain things, looking at the memory hot/un
>> plug code.
>>
>> The big picture:
>>
>> A paravirtualized device provides a physical memory region to the guest.
>> We could have multiple such devices. Each device is assigned to a NUMA
>> node. We want to control how much memory in such a region the guest is
>> allowed to use. We can dynamically add/remove memory to NUMA nodes this
>> way and make sure a guest cannot make use of more memory than requested.
>>
>> Especially: We decide in the kernel which memory block to online/offline.
>>
>>
>> The basic mechanism:
>>
>> The hypervisor provides a physical memory region to the guest. This
>> memory region can be used by the guest to plug/unplug memory. The
>> hypervisor asks for a certain amount of used memory and the guest should
>> try to reach that goal, by plugging/unplugging memory. Whenever the
>> guest wants to plug/unplug a block, it has to communicate that to the
>> hypervisor.
>>
>> The hypervisor can grant/deny requests to plug/unplug a block of memory.
>> Especially, the guest must not take more memory than requested. Trying
>> to read unplugged memory succeeds (e.g. for kdump), writing to that
>> memory is prohibited.
>>
>> Memory blocks can be of any granularity, but 1-4MB looks like a sane
>> amount to not fragment memory too much. If the guest can't find free
>> memory blocks, no unplug is possible.
>>
>>
>> In the guest, I add_memory() new memory blocks to the NORMAL zone. The
>> NORMAL zone makes it harder to remove memory but we don't run into any
>> problems (e.g. too little NORMAL memory e.g. for page tables). Now,
>> these chunks are fairly big (>= 128MB) and there seems to be no way to
>> plug/unplug smaller chunks to Linux using official interfaces ("memory
>> segments"). Trying to remove >=128MB of NORMAL memory will usually not
>> succeed. So I thought about manually removing parts of a memory section.
>>
>> Yes, this sounds similar to a balloon, but it is different: I have to
>> offline memory in a certain memory range, not just any memory in the
>> system. So I cannot simply use kmalloc() - there is no allocator that
>> guarantees that.
>>
>> So instead I want ahead and thought about simply manually
>> offlining/onlining parts of a memory segment - especially "page blocks".
>> I do my own bookkeeping about which parts of a memory segment are
>> online/offline and use that information for finding blocks to
>> plug/unplug. The offline_pages() interface made me assume that this
>> should work with blocks in the size of pageblock_nr_pages.
>>
>>
>> I stumbled over the following two problems:
>>
>> 1. __offline_isolated_pages() doesn't care about page blocks, it simply
>> calls offline_mem_sections(), which marks the whole section as offline,
>> although it has to remain online until all pages in that section were
>> offlined. Now this can be handled by moving the offline_mem_sections()
>> logic further outside to the caller of offline_pages().
>>
>> 2. While offlining 2MB blocks (page block size), I discovered that more
>> memory was marked as reserved. Especially, a page block contains pages
>> with an order 10 (4MB), which implies that two page blocks are "bound
>> together". This is also done in __offline_isolated_pages(). Offlining
>> 2MB will result in 4MB being marked as reserved.
>>
>> Now, when I switch to 4MB, my manual online_pages/offline_pages seems so
>> far to work fine.
>>
>> So my questions are:
>>
>> Can I assume that online_pages/offline_pages() works with "MAX_ORDER -
>> 1" sizes reliably? Should the checks in these functions be updated? page
>> blocks does not seem to be the real deal.
>>
>> Any better approach to allocate memory in a specific memory range
>> (without fake numa nodes)? So I could avoid using
>> online_pages/offline_pages and instead do it similar to a balloon
>> driver? (mark the page as reserved myself)
> 
> Not sure this answers your questions, but I did play with sub-section
> memory hotplug last year in this patch set, but it fell to the bottom
> of my queue. At least at the time it seemed possible to remove the
> section alignment constraints of memory hotplug.
> 
> https://lists.01.org/pipermail/linux-nvdimm/2017-March/009167.html
> 

Thanks, goes into a similar direction but seems to be more about "being
able to add a persistent memory device with bad alignment". The
!persistent memory part seems to be more complicated (e.g. struct pages
are allocated per segment).

In the meantime, I managed to make online_pages()/offline_pages() work
reliably with 4MB chunks. So I can e.g. add_memory() 128MB but only
online/offline 4MB chunks of that, which is sufficient for what I need
right now.

Will send some patches soon. Thanks!

-- 

Thanks,

David / dhildenb