On Wed, May 13, 2020 at 09:32:58AM +0800, Yafang Shao wrote: > On Wed, May 13, 2020 at 5:29 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > On Tue, Feb 11, 2020 at 12:55:07PM -0500, Johannes Weiner wrote: > > > The VFS inode shrinker is currently allowed to reclaim inodes with > > > populated page cache. As a result it can drop gigabytes of hot and > > > active page cache on the floor without consulting the VM (recorded as > > > "inodesteal" events in /proc/vmstat). > > > > I'm sending a rebased version of this patch. > > > > We've been running with this change in the Facebook fleet since > > February with no ill side effects observed. > > > > However, I just spent several hours chasing a mysterious reclaim > > problem that turned out to be this bug again on an unpatched system. > > > > In the scenario I was debugging, the problem wasn't that we were > > losing cache, but that we were losing the non-resident information for > > previously evicted cache. > > > > I understood the file set enough to know it was thrashing like crazy, > > but it didn't register as refaults to the kernel. Without detecting > > the refaults, reclaim wouldn't start swapping to relieve the > > struggling cache (plenty of cold anon memory around). It also meant > > the IO delays of those refaults didn't contribute to memory pressure > > in psi, which made userspace blind to the situation as well. > > > > The first aspect means we can get stuck in pathological thrashing, the > > second means userspace OOM detection breaks and we can leave servers > > (or Android devices, for that matter) hopelessly livelocked. > > > > New patch attached below. I hope we can get this fixed in 5.8, it's > > really quite a big hole in our cache management strategy. > > > > --- > > From 8db0b846ca0b7a136c0d3d8a1bee3d576990ba11 Mon Sep 17 00:00:00 2001 > > From: Johannes Weiner <hannes@xxxxxxxxxxx> > > Date: Tue, 11 Feb 2020 12:55:07 -0500 > > Subject: [PATCH] vfs: keep inodes with page cache off the inode shrinker LRU > > > > The VFS inode shrinker is currently allowed to reclaim cold inodes > > with populated page cache. This behavior goes back to CONFIG_HIGHMEM > > setups, which required the ability to drop page cache in large highem > > zones to free up struct inodes in comparatively tiny lowmem zones. > > > > However, it has significant side effects that are hard to justify on > > systems without highmem: > > > > - It can drop gigabytes of hot and active page cache on the floor > > without consulting the VM (recorded as "inodesteal" events in > > /proc/vmstat). Such an "aging inversion" between unreferenced inodes > > holding hot cache easily happens in practice: for example, a git tree > > whose objects are accessed frequently but no open file descriptors are > > maintained throughout. > > > > Hi Johannes, > > I think it is reasonable to keep inodes with _active_ page cache off > the inode shrinker LRU, but I'm not sure whether it is proper to keep > the inodes with _only_ inactive page cache off the inode list lru > neither. Per my understanding, if the inode has only inactive page > cache, then invalidate all these inactive page cache could save the > reclaimer's time, IOW, it may improve the performance in this case. The shrinker doesn't know whether pages are active or inactive. There is a PageActive() flag, but that's a sampled state that's only uptodate when page reclaim is running. All the active pages could be stale and getting deactivated on the next scan; all the inactive pages could have page table references that would get them activated on the next reclaim run etc. You'd have to duplicate aspects of page reclaim itself to be sure you're axing the right pages. It also wouldn't be a reliable optimization. This only happens when there is a disconnect between the inode and the cache life time, which is true for some situations but not others.