On 18.08.22 15:50, Charan Teja Kalla wrote: > The below is one path where race between page_ext and offline of the > respective memory blocks will cause use-after-free on the access of > page_ext structure. > > process1 process2 > --------- --------- > a)doing /proc/page_owner doing memory offline > through offline_pages. > > b)PageBuddy check is failed > thus proceed to get the > page_owner information > through page_ext access. > page_ext = lookup_page_ext(page); > > migrate_pages(); > ................. > Since all pages are successfully > migrated as part of the offline > operation,send MEM_OFFLINE notification > where for page_ext it calls: > offline_page_ext()--> > __free_page_ext()--> > free_page_ext()--> > vfree(ms->page_ext) > mem_section->page_ext = NULL > > c) Check for the PAGE_EXT flags > in the page_ext->flags access > results into the use-after-free(leading > to the translation faults). > > As mentioned above, there is really no synchronization between page_ext > access and its freeing in the memory_offline. > > The memory offline steps(roughly) on a memory block is as below: > 1) Isolate all the pages > 2) while(1) > try free the pages to buddy.(->free_list[MIGRATE_ISOLATE]) > 3) delete the pages from this buddy list. > 4) Then free page_ext.(Note: The struct page is still alive as it is > freed only during hot remove of the memory which frees the memmap, which > steps the user might not perform). > > This design leads to the state where struct page is alive but the struct > page_ext is freed, where the later is ideally part of the former which > just representing the page_flags (check [3] for why this design is > chosen). > > The above mentioned race is just one example __but the problem persists > in the other paths too involving page_ext->flags access(eg: > page_is_idle())__. > > Fix all the paths where offline races with page_ext access by > maintaining synchronization with rcu lock and is achieved in 3 steps: > 1) Invalidate all the page_ext's of the sections of a memory block by > storing a flag in the LSB of mem_section->page_ext. > > 2) Wait till all the existing readers to finish working with the > ->page_ext's with synchronize_rcu(). Any parallel process that starts > after this call will not get page_ext, through lookup_page_ext(), for > the block parallel offline operation is being performed. > > 3) Now safely free all sections ->page_ext's of the block on which > offline operation is being performed. > > Note: If synchronize_rcu() takes time then optimizations can be done in > this path through call_rcu()[2]. > > Thanks to David Hildenbrand for his views/suggestions on the initial > discussion[1] and Pavan kondeti for various inputs on this patch. > > [1] https://lore.kernel.org/linux-mm/59edde13-4167-8550-86f0-11fc67882107@xxxxxxxxxxx/ > [2] https://lore.kernel.org/all/a26ce299-aed1-b8ad-711e-a49e82bdd180@xxxxxxxxxxx/T/#u > [3] https://lore.kernel.org/all/6fa6b7aa-731e-891c-3efb-a03d6a700efa@xxxxxxxxxx/ > > Suggested-by: David Hildenbrand <david@xxxxxxxxxx> > Suggested-by: Michal Hocko <mhocko@xxxxxxxx> > Signed-off-by: Charan Teja Kalla <quic_charante@xxxxxxxxxxx> In general, LGTM, one comment below. > > static ssize_t > @@ -508,6 +527,14 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos) > /* Find an allocated page */ > for (; pfn < max_pfn; pfn++) { > /* > + * This temporary page_owner is required so > + * that we can avoid the context switches while holding > + * the rcu lock and copying the page owner information to > + * user through copy_to_user() or GFP_KERNEL allocations. > + */ > + struct page_owner page_owner_tmp; > + > + /* > * If the new page is in a new MAX_ORDER_NR_PAGES area, > * validate the area as existing, skip it if not > */ > @@ -525,7 +552,7 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos) > continue; > } > > - page_ext = lookup_page_ext(page); > + page_ext = page_ext_get(page); > if (unlikely(!page_ext)) > continue; > > @@ -534,14 +561,14 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos) > * because we don't hold the zone lock. > */ > if (!test_bit(PAGE_EXT_OWNER, &page_ext->flags)) > - continue; > + goto loop; > > /* > * Although we do have the info about past allocation of free > * pages, it's not relevant for current memory usage. > */ > if (!test_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags)) > - continue; > + goto loop; > > page_owner = get_page_owner(page_ext); > > @@ -550,7 +577,7 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos) > * would inflate the stats. > */ > if (!IS_ALIGNED(pfn, 1 << page_owner->order)) > - continue; > + goto loop; > > /* > * Access to page_ext->handle isn't synchronous so we should > @@ -558,13 +585,17 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos) > */ > handle = READ_ONCE(page_owner->handle); > if (!handle) > - continue; > + goto loop; > > /* Record the next PFN to read in the file offset */ > *ppos = (pfn - min_low_pfn) + 1; > > + page_owner_tmp = *page_owner; > + page_ext_put(page_ext); > return print_page_owner(buf, count, pfn, page, > - page_owner, handle); > + &page_owner_tmp, handle); > +loop: > + page_ext_put(page_ext); > } > > return 0; > @@ -617,18 +648,20 @@ static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone) > if (PageReserved(page)) > continue; > > - page_ext = lookup_page_ext(page); > + page_ext = page_ext_get(page); > if (unlikely(!page_ext)) > continue; > > /* Maybe overlapping zone */ > if (test_bit(PAGE_EXT_OWNER, &page_ext->flags)) > - continue; > + goto loop; > > /* Found early allocated page */ > __set_page_owner_handle(page_ext, early_handle, > 0, 0); > count++; > +loop: > + page_ext_put(page_ext); > } I kind-of dislike the "loop" labels. Can we come up with a more expressive name? "put_continue" or something? One alternative would be to add to the beginning of the loop, and after the loop sth like if (page_ext) { page_ext_put(page_ext); page_ext = NULL; } One could wrap that in a function, but not sure if that improves the situation. -- Thanks, David / dhildenb