Thanks Michal!! On 7/18/2022 8:24 PM, Michal Hocko wrote: >>>> The above mentioned race is just one example __but the problem persists >>>> in the other paths too involving page_ext->flags access(eg: >>>> page_is_idle())__. Since offline waits till the last reference on the >>>> page goes down i.e. any path that took the refcount on the page can make >>>> the memory offline operation to wait. Eg: In the migrate_pages() >>>> operation, we do take the extra refcount on the pages that are under >>>> migration and then we do copy page_owner by accessing page_ext. For >>>> >>>> Fix those paths where offline races with page_ext access by maintaining >>>> synchronization with rcu lock. >>> Please be much more specific about the synchronization. How does RCU >>> actually synchronize the offlining and access? Higher level description >>> of all the actors would be very helpful not only for the review but also >>> for future readers. >> I will improve the commit message about this synchronization change >> using RCU's. > Thanks! The most imporant part is how the exclusion is actual achieved > because that is not really clear at first sight > > CPU1 CPU2 > lookup_page_ext(PageA) offlining > offline_page_ext > __free_page_ext(addrA) > get_entry(addrA) > ms->page_ext = NULL > synchronize_rcu() > free_page_ext > free_pages_exact (now addrA is unusable) > > rcu_read_lock() > entryA = get_entry(addrA) > base + page_ext_size * index # an address not invalidated by the freeing path > do_something(entryA) > rcu_read_unlock() > > CPU1 never checks ms->page_ext so it cannot bail out early when the > thing is torn down. Or maybe I am missing something. I am not familiar > with page_ext much. Thanks a lot for catching this Michal. You are correct that the proposed code from me is still racy. I Will correct this along with the proper commit message in the next version of this patch. > >>> Also, more specifically >>> [...] >>>> diff --git a/mm/page_ext.c b/mm/page_ext.c >>>> index 3dc715d..5ccd3ee 100644 >>>> --- a/mm/page_ext.c >>>> +++ b/mm/page_ext.c >>>> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn) >>>> if (!ms || !ms->page_ext) >>>> return; >>>> base = get_entry(ms->page_ext, pfn); >>>> - free_page_ext(base); >>>> ms->page_ext = NULL; >>>> + synchronize_rcu(); >>>> + free_page_ext(base); >>>> } >>> So you are imposing the RCU grace period for each page_ext! This can get >>> really expensive. Have you tried to measure the effect? > I was wrong here! This is for each memory section which is not as > terrible as every single page_ext. This can be still quite a lot memory > sections in a single memory block (e.g. on ppc memory sections are > ridiculously small). > On the ARM64, I see that the minimum a section size will go is 128MB. I think 16MB is the section size on ppc. Any inputs on how frequently offline/online operation is being done on this ppc arch? >> I didn't really measure the effect. Let me measure it and post these in V2. > I think it would be much more optimal to split the operation into 2 > phases. Invalidate all the page_ext metadata then synchronize_rcu and > only then free them all. I am not very familiar with page_ext so I am > not sure this is easy to be done. Maybe page_ext = NULL can be done in > the first stage. > Let me explore If this can be easily done. >>> 3) Change the design where the page_ext is valid as long as the struct >>> page is alive. >> :/ Doesn't spark joy." > I would be wondering why. It should only take to move the callback to > happen at hotremove. So it shouldn't be very involved of a change. I can > imagine somebody would be relying on releasing resources when offlining > memory but is that really the case? I don't find any hard need of the clients needs to release this page ext memory. What I can think of is that page_ext size is proportional to the debug features(is what for being used on 64bit, as of now) we are enabling. Eg: Enabling the page_owner requires additional 0x30 bytes per page which memory is not required when the memory block is offlined. But then it should be the same case for memory occupied by struct page too for this offlined block. One comment from the initial discussion : "It smells like page_ext should use some mechanism during MEM_OFFLINE to synchronize against any users of its metadata. Generic memory offlining code might be the wrong place for that." -- I think the page_ext creation and deletion should fit into the sparse code. I will try to provide the changes on tomorrow and If it seems unfit there, I will work on improving the current patch based on the rcu logic. Thanks, Charan