On 4/5/23 14:45, Mel Gorman wrote: > On Thu, Feb 16, 2023 at 10:51:31AM +0100, Vlastimil Babka wrote: >> Historically, we have performed sanity checks on all struct pages being >> allocated or freed, making sure they have no unexpected page flags or >> certain field values. This can detect insufficient cleanup and some >> cases of use-after-free, although on its own it can't always identify >> the culprit. The result is a warning and the "bad page" being leaked. >> >> The checks do need some cpu cycles, so in 4.7 with commits 479f854a207c >> ("mm, page_alloc: defer debugging checks of pages allocated from the >> PCP") and 4db7548ccbd9 ("mm, page_alloc: defer debugging checks of freed >> pages until a PCP drain") they were no longer performed in the hot paths >> when allocating and freeing from pcplists, but only when pcplists are >> bypassed, refilled or drained. For debugging purposes, with >> CONFIG_DEBUG_VM enabled the checks were instead still done in the >> hot paths and not when refilling or draining pcplists. >> >> With 4462b32c9285 ("mm, page_alloc: more extensive free page checking >> with debug_pagealloc"), enabling debug_pagealloc also moved the sanity >> checks back to hot pahs. When both debug_pagealloc and CONFIG_DEBUG_VM >> are enabled, the checks are done both in hotpaths and pcplist >> refill/drain. >> >> Even though the non-debug default today might seem to be a sensible >> tradeoff between overhead and ability to detect bad pages, on closer >> look it's arguably not. As most allocations go through the pcplists, >> catching any bad pages when refilling or draining pcplists has only a >> small chance, insufficient for debugging or serious hardening purposes. >> On the other hand the cost of the checks is concentrated in the already >> expensive drain/refill batching operations, and those are done under the >> often contended zone lock. That was recently identified as an issue for >> page allocation and the zone lock contention reduced by moving the >> checks outside of the locked section with a patch "mm: reduce lock >> contention of pcp buffer refill", but the cost of the checks is still >> visible compared to their removal [1]. In the pcplist draining path >> free_pcppages_bulk() the checks are still done under zone->lock. >> >> Thus, remove the checks from pcplist refill and drain paths completely. >> Introduce a static key check_pages_enabled to control checks during page >> allocation a freeing (whether pcplist is used or bypassed). The static >> key is enabled if either is true: >> - kernel is built with CONFIG_DEBUG_VM=y (debugging) >> - debug_pagealloc or page poisoning is boot-time enabled (debugging) >> - init_on_alloc or init_on_free is boot-time enabled (hardening) >> >> The resulting user visible changes: >> - no checks when draining/refilling pcplists - less overhead, with >> likely no practical reduction of ability to catch bad pages >> - no checks when bypassing pcplists in default config (no >> debugging/hardening) - less overhead etc. as above >> - on typical hardened kernels [2], checks are now performed on each page >> allocation/free (previously only when bypassing/draining/refilling >> pcplists) - the init_on_alloc/init_on_free enabled should be sufficient >> indication for preferring more costly alloc/free operations for >> hardening purposes and we shouldn't need to introduce another toggle >> - code (various wrappers) removal and simplification >> >> [1] https://lore.kernel.org/all/68ba44d8-6899-c018-dcb3-36f3a96e6bea@xxxxxxxxxxxxxxxxxxx/ >> [2] https://lore.kernel.org/all/63ebc499.a70a0220.9ac51.29ea@xxxxxxxxxxxxx/ >> >> Reported-by: Alexander Halbuer <halbuer@xxxxxxxxxxxxxxxxxxx> >> Reported-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> >> Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx> > > Acked-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> Thanks. > Some minor comments below > >> @@ -1432,9 +1448,11 @@ static __always_inline bool free_pages_prepare(struct page *page, >> for (i = 1; i < (1 << order); i++) { >> if (compound) >> bad += free_tail_pages_check(page, page + i); > > free_tail_pages_check is also a function that only does something useful > when CONFIG_DEBUG_VM is set. While it might be outside the scope of the > patch, it might also benefit from check_pages_enabled checks? True, will send a followup. Will also rename it to free_tail_page_prepare() as it in fact also combines a preparation component with optional checks component. Will remove the unlikely()s you pointed out as well. >> - if (unlikely(free_page_is_bad(page + i))) { >> - bad++; >> - continue; >> + if (static_branch_unlikely(&check_pages_enabled)) { >> + if (unlikely(free_page_is_bad(page + i))) { >> + bad++; >> + continue; >> + } >> } >> (page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; >> } > > The unlikely() within a static_branch_unlikely probably adds very little > given the block is so tiny. > >> @@ -2392,56 +2369,20 @@ static inline int check_new_page(struct page *page) >> return 1; >> } >> >> -static bool check_new_pages(struct page *page, unsigned int order) >> +static inline bool check_new_pages(struct page *page, unsigned int order) >> { >> - int i; >> - for (i = 0; i < (1 << order); i++) { >> - struct page *p = page + i; >> + if (static_branch_unlikely(&check_pages_enabled)) { >> + for (int i = 0; i < (1 << order); i++) { >> + struct page *p = page + i; >> >> - if (unlikely(check_new_page(p))) >> - return true; >> + if (unlikely(check_new_page(p))) >> + return true; >> + } >> } >> > > unlikely() within static_branch_unlikely probably adds very little. > > Otherwise, looks good. I agree that with changes over time that the ability > of the checks to detect anything is reduced and it's probably at the point > where it can only detect a very specific bit corruption instead of broken > code. Commit 44042b449872 ("mm/page_alloc: allow high-order pages to be > stored on the per-cpu lists") also likely reduced the ability of the checks > to find anything. >