On 25.07.19 11:40, Oscar Salvador wrote: > On Thu, Jul 25, 2019 at 11:30:23AM +0200, David Hildenbrand wrote: >> On 25.07.19 11:27, Oscar Salvador wrote: >>> On Wed, Jul 24, 2019 at 01:11:52PM -0700, Dan Williams wrote: >>>> On Tue, Jun 25, 2019 at 12:53 AM Oscar Salvador <osalvador@xxxxxxx> wrote: >>>>> >>>>> This patch introduces MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK flags, >>>>> and prepares the callers that add memory to take a "flags" parameter. >>>>> This "flags" parameter will be evaluated later on in Patch#3 >>>>> to init mhp_restrictions struct. >>>>> >>>>> The callers are: >>>>> >>>>> add_memory >>>>> __add_memory >>>>> add_memory_resource >>>>> >>>>> Unfortunately, we do not have a single entry point to add memory, as depending >>>>> on the requisites of the caller, they want to hook up in different places, >>>>> (e.g: Xen reserve_additional_memory()), so we have to spread the parameter >>>>> in the three callers. >>>>> >>>>> The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ >>>>> in the way they allocate vmemmap pages within the memory blocks. >>>>> >>>>> MHP_MEMMAP_MEMBLOCK: >>>>> - With this flag, we will allocate vmemmap pages in each memory block. >>>>> This means that if we hot-add a range that spans multiple memory blocks, >>>>> we will use the beginning of each memory block for the vmemmap pages. >>>>> This strategy is good for cases where the caller wants the flexiblity >>>>> to hot-remove memory in a different granularity than when it was added. >>>>> >>>>> E.g: >>>>> We allocate a range (x,y], that spans 3 memory blocks, and given >>>>> memory block size = 128MB. >>>>> [memblock#0 ] >>>>> [0 - 511 pfns ] - vmemmaps for section#0 >>>>> [512 - 32767 pfns ] - normal memory >>>>> >>>>> [memblock#1 ] >>>>> [32768 - 33279 pfns] - vmemmaps for section#1 >>>>> [33280 - 65535 pfns] - normal memory >>>>> >>>>> [memblock#2 ] >>>>> [65536 - 66047 pfns] - vmemmap for section#2 >>>>> [66048 - 98304 pfns] - normal memory >>>>> >>>>> MHP_MEMMAP_DEVICE: >>>>> - With this flag, we will store all vmemmap pages at the beginning of >>>>> hot-added memory. >>>>> >>>>> E.g: >>>>> We allocate a range (x,y], that spans 3 memory blocks, and given >>>>> memory block size = 128MB. >>>>> [memblock #0 ] >>>>> [0 - 1533 pfns ] - vmemmap for section#{0-2} >>>>> [1534 - 98304 pfns] - normal memory >>>>> >>>>> When using larger memory blocks (1GB or 2GB), the principle is the same. >>>>> >>>>> Of course, MHP_MEMMAP_DEVICE is nicer when it comes to have a large contigous >>>>> area, while MHP_MEMMAP_MEMBLOCK allows us to have flexibility when removing the >>>>> memory. >>>> >>>> Concept and patch looks good to me, but I don't quite like the >>>> proliferation of the _DEVICE naming, in theory it need not necessarily >>>> be ZONE_DEVICE that is the only user of that flag. I also think it >>>> might be useful to assign a flag for the default 'allocate from RAM' >>>> case, just so the code is explicit. So, how about: >>> >>> Well, MHP_MEMMAP_DEVICE is not tied to ZONE_DEVICE. >>> MHP_MEMMAP_DEVICE was chosen to make a difference between: >>> >>> * allocate memmap pages for the whole memory-device >>> * allocate memmap pages on each memoryblock that this memory-device spans >> >> I agree that DEVICE is misleading here, you are assuming a one-to-one >> mapping between a device and add_memory(). You are actually taliing >> about "allocate a single chunk of mmap pages for the whole memory range >> that is added - which could consist of multiple memory blocks". > > Well, I could not come up with a better name. > > MHP_MEMMAP_ALL? > MHP_MEMMAP_WHOLE? As I said somewhere already (as far as I recall), one mode would be sufficient. If you want per memblock, add the memory in memblock granularity. So having a MHP_MEMMAP_ON_MEMORY that allocates it in one chunk would be sufficient for the current use cases (DIMMs, Hyper-V). MHP_MEMMAP_ON_MEMORY: Allocate the memmap for the added memory in one chunk from the beginning of the added memory. This piece of memory will be accessed and used even before the memory is onlined. Of course, if we want to make it configurable (e.g., for ACPI) it would be a different story. But for now this isn't really needed as far as I can tell. -- Thanks, David / dhildenb