Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

David Hildenbrand <david@xxxxxxxxxx> · Wed, 29 Jul 2020 10:44:20 +0200

On 29.07.20 10:27, Justin He wrote:
> Hi David
> 
>> -----Original Message-----
>> From: David Hildenbrand <david@xxxxxxxxxx>
>> Sent: Wednesday, July 29, 2020 2:37 PM
>> To: Justin He <Justin.He@xxxxxxx>
>> Cc: Dan Williams <dan.j.williams@xxxxxxxxx>; Vishal Verma
>> <vishal.l.verma@xxxxxxxxx>; Mike Rapoport <rppt@xxxxxxxxxxxxx>; David
>> Hildenbrand <david@xxxxxxxxxx>; Catalin Marinas <Catalin.Marinas@xxxxxxx>;
>> Will Deacon <will@xxxxxxxxxx>; Greg Kroah-Hartman
>> <gregkh@xxxxxxxxxxxxxxxxxxx>; Rafael J. Wysocki <rafael@xxxxxxxxxx>; Dave
>> Jiang <dave.jiang@xxxxxxxxx>; Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>;
>> Steve Capper <Steve.Capper@xxxxxxx>; Mark Rutland <Mark.Rutland@xxxxxxx>;
>> Logan Gunthorpe <logang@xxxxxxxxxxxx>; Anshuman Khandual
>> <Anshuman.Khandual@xxxxxxx>; Hsin-Yi Wang <hsinyi@xxxxxxxxxxxx>; Jason
>> Gunthorpe <jgg@xxxxxxxx>; Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>; Kees
>> Cook <keescook@xxxxxxxxxxxx>; linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; linux-
>> kernel@xxxxxxxxxxxxxxx; linux-nvdimm@xxxxxxxxxxxx; linux-mm@xxxxxxxxx; Wei
>> Yang <richardw.yang@xxxxxxxxxxxxxxx>; Pankaj Gupta
>> <pankaj.gupta.linux@xxxxxxxxx>; Ira Weiny <ira.weiny@xxxxxxxxx>; Kaly Xin
>> <Kaly.Xin@xxxxxxx>
>> Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem
>> alignment
>>
>>
>>
>>> Am 29.07.2020 um 05:35 schrieb Jia He <justin.he@xxxxxxx>:
>>>
>>> When enabling dax pmem as RAM device on arm64, I noticed that kmem_start
>>> addr in dev_dax_kmem_probe() should be aligned w/
>> SECTION_SIZE_BITS(30),i.e.
>>> 1G memblock size. Even Dan Williams' sub-section patch series [1] had
>> been
>>> upstream merged, it was not helpful due to hard limitation of kmem_start:
>>> $ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f
>> -a 2M
>>> $echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
>>> $echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
>>> $cat /proc/iomem
>>> ...
>>> 23c000000-23fffffff : System RAM
>>>  23dd40000-23fecffff : reserved
>>>  23fed0000-23fffffff : reserved
>>> 240000000-33fdfffff : Persistent Memory
>>>  240000000-2403fffff : namespace0.0
>>>  280000000-2bfffffff : dax0.0          <- aligned with 1G boundary
>>>    280000000-2bfffffff : System RAM
>>> Hence there is a big gap between 0x2403fffff and 0x280000000 due to the
>> 1G
>>> alignment.
>>>
>>> Without this series, if qemu creates a 4G bytes nvdimm device, we can
>> only
>>> use 2G bytes for dax pmem(kmem) in the worst case.
>>> e.g.
>>> 240000000-33fdfffff : Persistent Memory
>>> We can only use the memblock between [240000000, 2ffffffff] due to the
>> hard
>>> limitation. It wastes too much memory space.
>>>
>>> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
>> there
>>> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
>>> SPARSEMEM_VMEMMAP, page bits in struct page ...
>>>
>>> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
>> alignment
>>> with memory_block_size_bytes().
>>>
>>> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
>> pmem
>>> can be used as ram with smaller gap. Also the kmem hotplug add/remove
>> are both
>>> tested on arm64/x86 guest.
>>>
>>
>> Hi,
>>
>> I am not convinced this use case is worth such hacks (that’s what it is)
>> for now. On real machines pmem is big - your example (losing 50% is
>> extreme).
>>
>> I would much rather want to see the section size on arm64 reduced. I
>> remember there were patches and that at least with a base page size of 4k
>> it can be reduced drastically (64k base pages are more problematic due to
>> the ridiculous THP size of 512M). But could be a section size of 512 is
>> possible on all configs right now.
> 
> Yes, I once investigated how to reduce section size on arm64 thoughtfully:
> There are many constraints for reducing SECTION_SIZE_BITS
> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
>    much.
> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
>    into page->flags.

Yep.

> 3. MAX_ORDER depends on SECTION_SIZE_BITS 
>  - 3.1 mmzone.h
> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> #error Allocator MAX_ORDER exceeds SECTION_SIZE
> #endif

Yep, with 4k base pages it's 4 MB. However, with 64k base pages its
512MB ( :( ).

>  - 3.2 hugepage_init()
> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
> 
> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
> SECTION_SIZE_BITS can be reduced to 27.
> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
> be reduced to 27.

I think there were plans to eventually switch to 2MB THP with 64k base
pages as well (which can be emulated using some sort of consecutive PTE
entries under arm64, don't ask me how this feature is called),
theoretically also allowing smaller section sizes (when also reducing
MAX_ORDER properly) I would highly appreciate that switch. Having max
allocation/THP in the size of gigantic pages sounds very weird to me
(and creates issues e.g., to support hot(un)plug of small memory blocks
for virtio-mem). But I guess this is not under our control :)

> 
> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig
> might be very complicated,e.g. we still need to consider the case for
> ARM64_16K_PAGES.

Haven't looked into 16k base pages yet. But I remember it's in general
more similar to 4k than to 64k (speaking about sane THP sizes and
similar ...).

> 
>>
>> In the long term we might want to rework the memory block device model
>> (eventually supporting old/new as discussed with Michal some time ago
>> using a kernel parameter), dropping the fixed sizes
> 
> Has this been posted to Linux mm maillist? Sorry, searched and didn't find it.

Yeah, but I might not be able to dig it out anymore ...

Anyhow, the idea would be to have some magic switch that converts
between old and new world, to not break userspace that relies on that.

With old, everything would continue to work as it is. With *new* we
would have the reduced number of memory blocks for boot memory and
decoupled it from a strict, static memory block size.

There would be another option in corner cases right now. If you would
*know* that the metadata memory has no memmap/idendity mapping and have
1G alignment for your pmem device (including the metadata part)

1. add_memory_device_managed() the whole memory, including the metadata part
2. use generic_online_pages() to not expose metadata pages to the buddy
3. Mark metdata pages in a special way, such that you can e.g., allow to
offline memory again, including the metdata pages (e.g., PG_offline +
memory notifier like virtio-mem does)

3. would only be relevant to support offlining of memory again.

If the metadata part is, however, already ZONE_DEVICE with a memmap,
then that's not an option. (I have no idea how that metadata part is
used, sorry)

-- 
Thanks,

David / dhildenb