On 04.01.21 11:03, Michal Hocko wrote: > Hi, > back in March [1] you have recommended 53cdc1cb29e8 > ("drivers/base/memory.c: indicate all memory blocks as removable") to be > backported to stable trees and that has led to a more general discussion > about the current state of pfn walkers wrt. uninitialized pmem struct > pages. We haven't concluded any specific solution for that except for a > general sentiment that pfn_to_online_page should be able to catch those. > I might have missed any follow ups on that but I do not think we have > landed on any actual solution in the end. Have I just missed any followups? Thanks for raising this issue. It's still on my list of "broken and non-trivial to fix" things. So, as far as I recall, we still have the following two issues remaining: 1. pfn_to_online_page() false positives The semantics of pfn_to_online_page() were broken with sub-section hot-add in corner cases. If we have ZONE_DEVICE hot-added memory that overlaps in a section with boot memory, this memory section will contain parts ZONE_DEVICE memory and parts !ZONE_DEVICE memory. This can happen in sub-section granularity (2MB IIRC). pfn_to_online_page() will succeed on ZONE_DEVICE memory parts as the whole section is marked as online. Bad. One instance where this is still an issue is mm/memory-failure.c:memory_failure() and mm/memory-failure.c:soft_offline_page(). I thought for a while about "fixing" these, but to me it felt like fixing pfn_to_online_page() is actually the right approach. But worse, before ZONE_DEVICE hot-add 1. The whole section memmap does already exist (early sections always have a full memmap for the whole section) 2. The whole section memmap is initialized (although eventually with dummy node/zone 0/0 for memory holes until that part is fixed) and might be accessed by pfn walkers. So when hotadding ZONE_DEVICE we are modifying already existing and visible memmaps. Bad. 2. Deferred init of ZONE_DEVICE ranges memmap_init_zone_device() runs after the ZONE_DEVICE zone was resized and outside the memhp lock. I did not follow if the use of get_dev_pagemap() actually makes sure that memmap_init_zone_device() in pagemap_range() actually completed. I don't think it does. > > Is anybody working on that? > I believe Dan mentioned somewhere that he wants to see a real instance of this producing a BUG before actually moving forward with a fix. I might be wrong. We might tackle 1. by: a) [worked-around] doing get_dev_pagemap() before pfn_to_online_page() - just plain ugly. b) [worked-around] using a sub-section online-map and extending pfn_to_online_page(). IMHO ugly to fix this corner case. c) [worked-around] extending pfn_to_online_page() by zone checks. IMHO racy. d) fixed by not allowing ZONE_DEVICE and !ZONE_DEVICE within a single section. In the worst case, don't add partially present sections that have big holes in the beginning/end. Like, if there is a 128MB section with 126MB of memory followed by a 2MB hole, don't add that memory to Linux with CONFIG_ZONE_DEVICE. Might turn some memory unusable, but well, it would be the price to pay for simplicity. Being able to hotadd PMEM is more important than using each and every last MB of memory. e) decrease the section size and drop sub-section hotadd support. As sub-section hotadd is 2MB and MAX_ORDER - 1 corresponds 4MB, this is mostly impossible. Worse on aarch64 with 64k base pages - 1024MB sections and MAX_ORDER - 1 corresponds 512MB. I think this is not feasible. We might tackle 2. by making sure that get_dev_pagemap() only succeeds after memmap_init_zone_device() finished. As I said, not sure if that is already done. > Also is there any reference explaining what those struct pages are and > why we cannot initialize them? I am sorry if this has been explained to > me but I really cannot find that in my notes anywhere. Most pmem pages > should be initialized via memmap_init_zone_device, right? Long story short, IMHO we have to a) fix pfn_to_online_page() to only succeed for !ZONE_DEVICE memory. b) fix get_dev_pagemap() to only succeed if memmap_init_zone_device() completed. (again, unless this already works as desired) Dealing with races (memory_offlining() racing with pfn_to_online_page()) is another story and stuff for the future. Another level of complexity :) > I am asking mostly because we are starting to see these issues in > production and while the only visible problem so far is a crash when > reading sysfs (removable file) I am worried we are just lucky no other > pfn walker stumble over this. Which exact issue are you seeing? The one tackled by "[PATCH v1] drivers/base/memory.c: indicate all memory blocks as removable" ? -- Thanks, David / dhildenb