On Wed, Jun 16, 2021 at 11:20:08AM +1000, Dave Chinner wrote: > On Tue, Jun 15, 2021 at 02:50:09PM -0400, Johannes Weiner wrote: > > On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote: > > > And in __list_lru_walk_one() just add: > > > > > > case LRU_ROTATE_NODEFER: > > > isolated++; > > > /* fallthrough */ > > > case LRU_ROTATE: > > > list_move_tail(item, &l->list); > > > break; > > > > > > And now inodes with active page cache rotated to the tail of the > > > list and are considered to have had work done on them. Hence they > > > don't add to the work accumulation that the shrinker infrastructure > > > defers, and so will allow the page reclaim to do it's stuff with > > > page reclaim before such inodes will get reclaimed. > > > > > > That's *much* simpler than your proposed patch and should get you > > > pretty much the same result. > > > > It solves the deferred work buildup, but it's absurdly inefficient. > > So you keep saying. Show us the numbers. Show us that it's so > inefficient that it's completely unworkable. _You_ need to justify > why violating modularity and layering is the only viable solution to > this problem. Given that there is an alternative simple, straight > forward solution to the problem, it's on you to prove it is > insufficient to solve your issues. > > I'm sceptical that the complexity is necessary given that in general > workloads, the inode shrinker doesn't even register in kernel > profiles and that the problem being avoided generally isn't even hit > in most workloads. IOWs, I'll take a simple but inefficient solution > for avoiding a corner case behaviour over a solution that is > complex, fragile and full of layering violations any day of the > weeks. I spent time last week benchmarking both implementations with various combinations of icache and page cache size proportions. You're right that most workloads don't care. But there are workloads that do, and for them the behavior can become pathological during drop-behind reclaim. Page cache reclaim has two modes: 1. Workingset transitions where we flush out the old as quickly as possible and 2. Streaming buffered IO that doesn't benefit from caching, and so gets confined to the smallest possible amount of memory without touching active pages. During 1. we may rotate busy inodes a few times until their page cache disappears. This isn't great, but at least temporary. The issue is 2. We may do drop-behind reclaim for extended periods of time, during which the cache workingset remains completely untouched and the corresponding inodes never become eligible for freeing. Rotating them over and over represents a continuous parasitic drag on reclaim. Depending on the proportions between the icache and the inactive cache list, this drag can make up a sizable portion or even the majority of overall CPU consumed by reclaim. (And if you recall the discussion around RWF_UNCACHED, dropbehind reclaim is already bottlenecked on CPU with decent IO devices.) My test is doing drop-behind reclaim while most memory is filled with a cache workingset that is held by an increasing number of inodes. The first number here is the inodes, the second is the active pages held by each: 1,000 * 3072 pages: 0.39% 0.05% kswapd0 [kernel.kallsyms] [k] shrink_slab 10,000 * 307 pages: 0.39% 0.04% kswapd0 [kernel.kallsyms] [k] shrink_slab 100,000 * 32 pages: 1.29% 0.05% kswapd0 [kernel.kallsyms] [k] shrink_slab 500,000 * 6 pages: 11.36% 0.08% kswapd0 [kernel.kallsyms] [k] shrink_slab 1,000,000 * 3 pages: 26.40% 0.04% kswapd0 [kernel.kallsyms] [k] shrink_slab 1,500,000 * 2 pages: 42.97% 0.00% kswapd0 [kernel.kallsyms] [k] shrink_slab 3,000,000 * 1 page: 45.22% 0.00% kswapd0 [kernel.kallsyms] [k] shrink_slab As we get into higher inode counts, the shrinkers end up burning most of the reclaim cycles to rotate workingset inodes. For perspective, with 3 million inodes, when the shrinkers eat 45% of the cycles to busypoll the workingset inodes, page reclaim only consumes about 10% to actually make forward progress. IMO it goes from suboptimal to being a problem somewhere between 100k and 500k in this table. That's not *that* many inodes - I'm counting ~74k files in my linux git tree alone. North of 500k, it becomes pathological. That's probably less common, but it happens in the real world. I checked the file servers that host our internal source code trees. They have 16 times the memory of my test box, but they routinely deal with 50 million+ inodes. I think the additional complexity of updating the inode LRU according to cache population state is justified in order to avoid these pathological cornercases.