>> Please note that we have other users that use PG_offline + refcount >= 1 >> (HyperV balloon, XEN). We should not affect these users (IOW, >> has_unmovable_pages() has to stop right there if we see one of these pages). > > OK, this is exactly what I was worried about. I can see why you might > want to go an easier way and rule those users out but wouldn't be it > actually more reasonable to explicitly request PageOffline users to > implement MEM_GOING_OFFLINE and prepare their offlined pages for the > offlining operation or fail right there if that is not possible. > If you fail right there during the isolation phase then there is no way > to allow the offlining to proceed from that context. I am not sure I agree. But let's discuss the details. See below. > >>>> 2) memory_notify(MEM_GOING_OFFLINE, &arg); >>>> -> Here, we could release all pages to the buddy, clearing PG_offline >>>> -> PF_offline must not be cleared so dumping tools will not touch >>>> these pages. There is a time where pages are !PageBuddy() and >>>> !PageOffline(). >>> >>> Well, this is fully under control of the driver, no? Reference count >>> shouldn't play any role here AFAIU. >> >> Yes, this is more a PG_offline issue. The reference count is an issue of >> reaching this call :) If we want to go via the buddy: >> >> 1. Clear PG_offline >> 2. Free page (gets set PG_buddy) >> >> Between 1 and 2, a dumping tool could not exclude these pages and therefore >> try to read from these pages. That is an issue IFF we want to return the >> pages back to the buddy instead of doing what I propose here. > > If the driver is going to free page to the allocator then it would have > to claim the page back and so it is usable again. If it cannot free it > then it would simply set the reference count to 0. It can even keep the > PG_offline if necessary although I have to admit I am not really sure it > is necessary. Yes it is necessary to keep PG_offline set to avoid anybody touching the page until the section is offline. (especially, dumping tools) But that's another discussion. The important part is to not go via the buddy. > >>>> 3) scan_movable_pages() ... >> >> Please note that when we don't put the pages back to the buddy and don't >> implement something like I have in this patch, we'll loop/fail here. >> Especially if we have pages with PG_offline + refcount >= 1 . > > You should have your reference count 0 at this stage as it is after > MEM_GOING_OFFLINE. > >>> MEM_CANCEL_OFFLINE could gain the reference back to balance the >>> MEM_GOING_OFFLINE step. >> >> The pages are already unisolated and could be used by the buddy. But again, >> I think you have an idea that tries to avoid putting pages to the buddy. > > Yeah, set_page_count(page, 0) if you do not want to release that page > from the notifier context to reflect that the page is ok to be offlined > with the rest. > I neither see how you deal with __test_page_isolated_in_pageblock() nor with __offline_isolated_pages(). Sorry, but what I read is incomplete and you probably have a full proposal in your head. Please read below how I think you want to solve it. > >>> explicit control via the reference count which is the standard way to >>> control the struct page life cycle. >>> >>> Anyway hooking into __put_page (which tends to be a hot path with >>> something that is barely used on most systems) doesn't sound nice to me. >>> This is the whole point which made me think about the whole reference >>> count approach in the first place. >> >> Again, the race I think that is possible >> >> somebody: get_page_unless_zero(page) >> virtio_mem: page_ref_dec(pfn_to_page(pfn) >> somebody: put_page() -> straight to the buddy > > Who is that somebody? I thought that it is only the owner/driver to have > a control over the page. Also the above is not possible as long as the > owner/driver keeps a reference to the PageOffline page throughout the > time it is marked that way. > I was reading include/linux/mm_types.h: "If you want to use the refcount field, it must be used in such a way that other CPUs temporarily incrementing and then decrementing the refcount does not cause problems" And that made me think "anybody can go ahead and try get_page_unless_zero()". If I am missing something here and this can indeed not happen (e.g., because PageOffline() pages are never mapped to user space), then I'll happily remove this code. > >>>>> If you can let the page go then just drop the reference count. The page >>>>> is isolated already by that time. If you cannot let it go for whatever >>>>> reason you can fail the offlining. >>>> >>>> We do have one hack in current MM code, which is the memory isolation >>>> notifier only used by CMM on PPC. It allows to "cheat" has_unmovable_pages() >>>> to skip over unmovable pages. But quite frankly, I rather want to get rid of >>>> that crap (something I am working on right now) than introduce new users. >>>> This stuff is racy as hell and for CMM, if memory offlining fails, the >>>> ballooned pages are suddenly part of the buddy. Fragile. >>> >>> Could you be more specific please? >> >> Let's take a look at how arch/powerpc/platforms/pseries/cmm.c handles it: >> >> cmm_memory_isolate_cb() -> cmm_count_pages(arg): >> - Memory Isolation notifier callback >> - Count how many pages in the range to be isolated are in the ballooon >> - This makes has_unmovable_pages() succeed. Pages can be isolated. >> >> cmm_memory_cb -> cmm_mem_going_offline(arg): >> - Memory notifier (online/offline) >> - Release all pages in the range to the buddy >> >> If offlining fails, the pages are now in the buddy, no longer in the >> balloon. MEM_CANCEL_ONLINE is too late, because the range is already >> unisolated again and the pages might be in use. >> >> For CMM it might not be that bad, because it can actually "reloan" any >> pages. In contrast, virtio-mem cannot simply go ahead and reuse random >> memory in unplugged. Any access to these pages would be evil. Giving them >> back to the buddy is dangerous. > > Thanks, I was not aware of that code. But from what I understood this is > an outright bug in this code because cmm_mem_going_offline releases > pages to the buddy allocator which is something that is not recoverable > on a later failure. > Yes, and that should be gone if we switch to balloon compaction. Let's recap what I suggest: "PageOffline() pages that have a reference count of 0 will be treated like free pages when offlining pages, allowing the containing memory block to get offlined. In case a driver wants to revive such a page, it has to synchronize against memory onlining/offlining (e.g., using memory notifiers) while incrementing the reference count. Also, a driver that relies in this feature is aware that re-onlining the memory will require to re-set the pages PageOffline() - e.g., via the online_page_callback_t." a) has_unmovable_pages() already skips over pages with a refcount of zero. The code I add to not skip over these pages when !MEMORY_OFFLINE is a pure optimization to fail early when trying to allocate from that range. b) __test_page_isolated_in_pageblock() is modified to skip over PageOffline() pages with a refcount of zero c) __offline_isolated_pages() is modified to skip over PageOffline() pages with a refcount of zero d) __put_page() is modified to not return pages to the buddy in any case as a safety net. We might be able to get rid of that. What I think you suggest: a) has_unmovable_pages() skips over all PageOffline() pages. This results in a lot of false negatives when trying to offline. Might be ok. b) The driver decrements the reference count of the PageOffline pages in MEM_GOING_OFFLINE. c) The driver increments the reference count of the PageOffline pages in MEM_CANCEL_OFFLINE. One issue might be that the pages are no longer isolated once we get that call. Might be ok. d) How to make __test_page_isolated_in_pageblock() succeed? Like I propose in this patch (PageOffline() + refcount == 0)? e) How to make __offline_isolated_pages() succeed? Like I propose in this patch (PageOffline() + refcount == 0)? In summary, is what you suggest simply delaying setting the reference count to 0 in MEM_GOING_OFFLINE instead of right away when the driver unpluggs the pages? What's the big benefit you see and I fail to see? -- Thanks, David / dhildenb