On Wed, Jan 26, 2022 at 06:34:21PM +0000, Pasha Tatashin wrote:
The problems with page->_refcount are hard to debug, because usually when they are detected, the damage has occurred a long time ago. Yet, the problems with invalid page refcount may be catastrophic and lead to memory corruptions. Reduce the scope of when the _refcount problems manifest themselves by adding checks for underflows and overflows into functions that modify _refcount.
If you're chasing a bug like this, presumably you turn on page tracepoints. So could we reduce the cost of this by putting the VM_BUG_ON_PAGE parts into __page_ref_mod() et al? Yes, we'd need to change the arguments to those functions to pass in old & new, but that should be a cheap change compared to embedding the VM_BUG_ON_PAGE.
static inline void page_ref_add(struct page *page, int nr) { - atomic_add(nr, &page->_refcount); + int old_val = atomic_fetch_add(nr, &page->_refcount); + int new_val = old_val + nr; + + VM_BUG_ON_PAGE((unsigned int)new_val < (unsigned int)old_val, page); if (page_ref_tracepoint_active(page_ref_mod)) __page_ref_mod(page, nr); }