On Wed, Jan 26, 2022 at 02:22:26PM -0500, Pasha Tatashin wrote: > On Wed, Jan 26, 2022 at 1:59 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > On Wed, Jan 26, 2022 at 06:34:21PM +0000, Pasha Tatashin wrote: > > > The problems with page->_refcount are hard to debug, because usually > > > when they are detected, the damage has occurred a long time ago. Yet, > > > the problems with invalid page refcount may be catastrophic and lead to > > > memory corruptions. > > > > > > Reduce the scope of when the _refcount problems manifest themselves by > > > adding checks for underflows and overflows into functions that modify > > > _refcount. > > > > If you're chasing a bug like this, presumably you turn on page > > tracepoints. So could we reduce the cost of this by putting the > > VM_BUG_ON_PAGE parts into __page_ref_mod() et al? Yes, we'd need to > > change the arguments to those functions to pass in old & new, but that > > should be a cheap change compared to embedding the VM_BUG_ON_PAGE. > > This is not only about chasing a bug. This also about preventing > memory corruption and information leaking that are caused by ref_count > bugs from happening. > Several months ago a memory corruption bug was discovered by accident: > an engineer was studying a process core from a production system and > noticed that some memory does not look like it belongs to the original > process. We tried to manually reproduce that bug but failed. However, > later analysis by our team, explained that the problem occured due to > ref_count bug in Linux, and the bug itself was root caused and fixed > (mentioned in the cover letter). This work would have prevented > similar ref_count bugs from yielding to the memory corruption > situation. But the VM_BUG_ON_PAGE tells us next to nothing useful. To take your first example [1] as the kind of thing you say this is going to help fix: 1. Page p is allocated by thread a (refcount 1) 2. Thread b gets mistaken pointer to p 3. Thread b calls put_page(), __put_page(), page goes to memory allocator. 4. Thread c calls alloc_page(), also gets page p (refcount 1 again). 5. Thread a calls put_page(), __put_page() 6. Thread c calls put_page() and gets a VM_BUG_ON_PAGE. How do we find thread b's involvement? I don't think we can even see thread a's involvement in all of this! All we know is a backtrace pointing to thread c, who is a completely innocent bystander. I think you have to enable page tracepoints to have any shot at finding thread b's involvement. [1] https://lore.kernel.org/stable/20211122171825.1582436-1-gthelen@xxxxxxxxxx/