On 27.07.22 16:15, Charan Teja Kalla wrote: > The below is one path where race between page_ext and offline of the > respective memory blocks will cause use-after-free on the access of > page_ext structure. > > process1 process2 > --------- --------- > a)doing /proc/page_owner doing memory offline > through offline_pages. > > b)PageBuddy check is failed > thus proceed to get the > page_owner information > through page_ext access. > page_ext = lookup_page_ext(page); > > migrate_pages(); > ................. > Since all pages are successfully > migrated as part of the offline > operation,send MEM_OFFLINE notification > where for page_ext it calls: > offline_page_ext()--> > __free_page_ext()--> > free_page_ext()--> > vfree(ms->page_ext) > mem_section->page_ext = NULL > > c) Check for the PAGE_EXT flags > in the page_ext->flags access > results into the use-after-free(leading > to the translation faults). > > As mentioned above, there is really no synchronization between page_ext > access and its freeing in the memory_offline. > > The memory offline steps(roughly) on a memory block is as below: > 1) Isolate all the pages > 2) while(1) > try free the pages to buddy.(->free_list[MIGRATE_ISOLATE]) > 3) delete the pages from this buddy list. > 4) Then free page_ext.(Note: The struct page is still alive as it is > freed only during hot remove of the memory which frees the memmap, which > steps the user might not perform). > > This design leads to the state where struct page is alive but the struct > page_ext is freed, where the later is ideally part of the former which > just representing the page_flags. > > The above mentioned race is just one example __but the problem persists > in the other paths too involving page_ext->flags access(eg: > page_is_idle())__. Since offline waits till the last reference on the > page goes down i.e. any path that took the refcount on the page can make > the memory offline operation to wait. Eg: In the migrate_pages() > operation, we do take the extra refcount on the pages that are under > migration and then we do copy page_owner by accessing page_ext. For > > Fix those paths where offline races with page_ext access by maintaining > synchronization with rcu lock and is achieved in 3 steps: > 1) Invalidate all the page_ext's of the sections of a memory block by > storing a flag in the LSB of mem_section->page_ext. > > 2) Wait till all the existing readers to finish working with the > ->page_ext's with synchronize_rcu(). Any parallel process that starts > after this call will not get page_ext, through lookup_page_ext(), for > the block parallel offline operation is being performed. > > 3) Now safely free all sections ->page_ext's of the block on which > offline operation is being performed. > > Thanks to David Hildenbrand for his views/suggestions on the initial > discussion[1] and Pavan kondeti for various inputs on this patch. > > FAQ's: > Q) Should page_ext_[get|put]() needs to be used for every page_ext > access? > A) NO, the synchronization is really not needed in all the paths of > accessing page_ext. One case is where extra refcount is taken on a > page for which memory block, this pages falls into, offline operation is > being performed. This extra refcount makes the offline operation not to > succeed hence the freeing of page_ext. Another case is where the page > is already being freed and we do reset its page_owner. > > Some examples where the rcu_lock is not taken while accessing the > page_ext are: > 1) In migration (where we also migrate the page_owner information), we > take the extra refcount on the source and destination pages and then > start the migration. This extra refcount makes the test_pages_isolated() > function to fail thus retry the offline operation. > > 2) In free_pages_prepare(), we do reset the page_owner(through page_ext) > which again doesn't need the protection to access because the page is > already freeing (through only one path). > > So, users need not to use page_ext_[get|put]() when they are sure that > extra refcount is taken on a page preventing the offline operation. > > Q) Why can't the page_ext is freed in the hot_remove path, where memmap > is also freed ? > > A) As per David's answers, there are many reasons and a few are: > 1) Discussions had happened in the past to eventually also use rcu > protection for handling pfn_to_online_page(). So doing it cleanly here > is certainly an improvement. > > 2) It's not good having to scatter section online checks all over the > place in page ext code. Once there is a difference between active vs. > stale page ext data things get a bit messy and error prone. This is > already ugly enough in our generic memmap handling code. > > 3) Having on-demand allocations, such as KASAN or page ext from the > memory online notifier is at least currently cleaner, because we don't > have to handle each and every subsystem that hooks into that during the > core memory hotadd/remove phase, which primarily only setups the > vmemmap, direct map and memory block devices. > > [1] https://lore.kernel.org/linux-mm/59edde13-4167-8550-86f0-11fc67882107@xxxxxxxxxxx/ > I guess if we care about the synchronize_rcu() we could go crazy with temporary allocations for data-to-free + call_rcu(). -- Thanks, David / dhildenb