On 1/26/22 20:22, Pasha Tatashin wrote: > On Wed, Jan 26, 2022 at 1:59 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: >> >> On Wed, Jan 26, 2022 at 06:34:21PM +0000, Pasha Tatashin wrote: >> > The problems with page->_refcount are hard to debug, because usually >> > when they are detected, the damage has occurred a long time ago. Yet, >> > the problems with invalid page refcount may be catastrophic and lead to >> > memory corruptions. >> > >> > Reduce the scope of when the _refcount problems manifest themselves by >> > adding checks for underflows and overflows into functions that modify >> > _refcount. >> >> If you're chasing a bug like this, presumably you turn on page >> tracepoints. So could we reduce the cost of this by putting the >> VM_BUG_ON_PAGE parts into __page_ref_mod() et al? Yes, we'd need to >> change the arguments to those functions to pass in old & new, but that >> should be a cheap change compared to embedding the VM_BUG_ON_PAGE. > > This is not only about chasing a bug. This also about preventing > memory corruption and information leaking that are caused by ref_count > bugs from happening. So you mean it like a security hardening feature, not just debugging? To me it's dubious to put security hardening under CONFIG_DEBUG_VM. I think it's just Fedora that uses DEBUG_VM in general production kernels? > Several months ago a memory corruption bug was discovered by accident: > an engineer was studying a process core from a production system and > noticed that some memory does not look like it belongs to the original > process. We tried to manually reproduce that bug but failed. However, > later analysis by our team, explained that the problem occured due to > ref_count bug in Linux, and the bug itself was root caused and fixed > (mentioned in the cover letter). This work would have prevented > similar ref_count bugs from yielding to the memory corruption > situation. > > Pasha >