On Wed, Nov 28, 2018 at 04:36:25PM -0800, Ivan Babrou wrote: > Hello, > > We're experiencing some interesting issues with memory reclaim, both > kswapd and direct reclaim. > > A typical machine is 2 x NUMA with 128GB of RAM and 6 XFS filesystems. > Page cache is around 95GB and dirty pages hover around 50MB, rarely > jumping up to 1GB. What is your workload? > The catalyst of our issue is terrible disks. It's not uncommon to see > the following stack in hung task detector: > > Nov 15 21:55:13 21m21 kernel: INFO: task some-task:156314 blocked for > more than 10 seconds. > Nov 15 21:55:13 21m21 kernel: Tainted: G O > 4.14.59-cloudflare-2018.7.5 #1 > Nov 15 21:55:13 21m21 kernel: "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Nov 15 21:55:13 21m21 kernel: some-task D11792 156314 156183 0x00000080 > Nov 15 21:55:13 21m21 kernel: Call Trace: > Nov 15 21:55:13 21m21 kernel: ? __schedule+0x21a/0x820 > Nov 15 21:55:13 21m21 kernel: schedule+0x28/0x80 > Nov 15 21:55:13 21m21 kernel: schedule_preempt_disabled+0xa/0x10 > Nov 15 21:55:13 21m21 kernel: __mutex_lock.isra.2+0x16a/0x490 > Nov 15 21:55:13 21m21 kernel: ? xfs_reclaim_inodes_ag+0x265/0x2d0 > Nov 15 21:55:13 21m21 kernel: xfs_reclaim_inodes_ag+0x265/0x2d0 > Nov 15 21:55:13 21m21 kernel: ? kmem_cache_alloc+0x14d/0x1b0 > Nov 15 21:55:13 21m21 kernel: ? radix_tree_gang_lookup_tag+0xc4/0x130 > Nov 15 21:55:13 21m21 kernel: ? __list_lru_walk_one.isra.5+0x33/0x130 > Nov 15 21:55:13 21m21 kernel: xfs_reclaim_inodes_nr+0x31/0x40 > Nov 15 21:55:13 21m21 kernel: super_cache_scan+0x156/0x1a0 > Nov 15 21:55:13 21m21 kernel: shrink_slab.part.51+0x1d2/0x3a0 > Nov 15 21:55:13 21m21 kernel: shrink_node+0x113/0x2e0 > Nov 15 21:55:13 21m21 kernel: do_try_to_free_pages+0xb3/0x310 > Nov 15 21:55:13 21m21 kernel: try_to_free_pages+0xd2/0x190 > Nov 15 21:55:13 21m21 kernel: __alloc_pages_slowpath+0x3a3/0xdc0 > Nov 15 21:55:13 21m21 kernel: ? ip_output+0x5c/0xc0 > Nov 15 21:55:13 21m21 kernel: ? update_curr+0x141/0x1a0 > Nov 15 21:55:13 21m21 kernel: __alloc_pages_nodemask+0x223/0x240 > Nov 15 21:55:13 21m21 kernel: skb_page_frag_refill+0x93/0xb0 > Nov 15 21:55:13 21m21 kernel: sk_page_frag_refill+0x19/0x80 > Nov 15 21:55:13 21m21 kernel: tcp_sendmsg_locked+0x247/0xdc0 > Nov 15 21:55:13 21m21 kernel: tcp_sendmsg+0x27/0x40 > Nov 15 21:55:13 21m21 kernel: sock_sendmsg+0x36/0x40 > Nov 15 21:55:13 21m21 kernel: sock_write_iter+0x84/0xd0 > Nov 15 21:55:13 21m21 kernel: __vfs_write+0xdd/0x140 > Nov 15 21:55:13 21m21 kernel: vfs_write+0xad/0x1a0 > Nov 15 21:55:13 21m21 kernel: SyS_write+0x42/0x90 > Nov 15 21:55:13 21m21 kernel: do_syscall_64+0x60/0x110 > Nov 15 21:55:13 21m21 kernel: entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > Here "some-task" is trying to send some bytes over network and it's > stuck in direct reclaim. Naturally, kswapd is not keeping up with its > duties. That's not kswapd causing the problem here, that's direct reclaim. > It seems to me that our terrible disks sometimes take a pause to think > about the meaning of life for a few seconds. You've probably got lots of dirty inodes backed up waiting for IO. How big are xfs inode slab caches (slabinfo output, please)? How big are the logs on the filesystems on the machine (xfs_info output, please)? > During that time XFS > shrinker is stuck, which drives the whole system out of free memory > and in turns triggers direct reclaim. It happens the other way around. What you have here is direct reclaim trying to get the reclaim lock for a given AG to be able to reclaim inodes from it. direct reclaim only gets stuck here when all AGs are already being reclaimed from (i.e. already under direct reclaim). > One solution to this is to not go into direct reclaim by keeping more > free pages with vm.watermark_scale_factor, but I'd like to discard > this and argue that we're going to hit direct reclaim at some point > anyway. Right, but the problem is that the mm/ subsystem allows effectively unbound direct reclaim concurrency. At some point, having tens to hundreds of direct reclaimers all trying to write dirty inodes to disk causes catastrophic IO breakdown and everything grinds to a halt forever. We have to prevent that breakdown from occurring. i.e. we have to throttle direct reclaim to before it reaches IO breakdown /somewhere/. The memory reclaim subsystem does not do it, so we have to do it in XFS itself. The problem here is that if we ignore direct reclaim (i.e do nothing rather than block waiting on reclaim progress) then the mm/ reclaim algorithms will eventually think they aren't making progress and unleash the OOM killer. > The solution I have in mind for this is not to try to write anything > to (disastrously terrible) storage in shrinkers. We have 95GB of page > cache readily available for reclaim and it seems a lot cheaper to grab > that. The mm/ subsystem tries to keep a balance between page cache and shrinker controlled caches. /proc/sys/vm/vfs_cache_pressure controls the balance between page cache and inode/dentry cache reclaim pressure.. > That brings me to the first question around memory subsystem: are > shrinkers supposed to flush any dirty data? No shrinker does that, but if they wanted to they would be perfectly within their right to write back data. XFS is writing back /metadata/ here because direct reclaim is putting excessive pressure on the inode cache and not allowing the background cleaners to do their job. > My gut feeling is that > they should not do that, because there's already writeback mechanism > with own tunables for limits to take care of that. If a system runs > out of memory reclaimable without IO and dirty pages are under limit, > it's totally fair to OOM somebody. > > It's totally possible that I'm wrong about this feeling, but either > way I think docs need an update on this matter: > > * https://elixir.bootlin.com/linux/v4.14.55/source/Documentation/filesystems/vfs.txt > > nr_cached_objects: called by the sb cache shrinking function for the > filesystem to return the number of freeable cached objects it contains. You are assuming that "freeable" means "instantly freeable object", not "unreferenced object that can be freed in the near future". We use the latter definition in the shrinkers, not the former. > My second question is conditional on the first one: if filesystems are > supposed to flush dirty data in response to shrinkers, then how can I > stop this, given my knowledge about combination of lots of available > page cache and terrible disks? Filesystems have more caches that just the page cache. > I've tried two things to address this problem ad-hoc. > > 1. Run the following systemtap script to trick shrinkers into thinking > that XFS has nothing to free: > > probe kernel.function("xfs_fs_nr_cached_objects").return { > $return = 0 > } > > That did the job and shrink_node latency dropped considerably, while > calls to xfs_fs_free_cached_objects disappeared. Which effectively turned off direct reclaim for XFS inodes. See above - this just means that when you have no easily reclaimable page cache the system will OOM kill rather than wait for inodes to be reclaimed. i.e. it looks good until everything suddenly goes wrong and then everything dies a horrible death. > 2. Use vm.vfs_cache_pressure to do the same thing. This failed > miserably, because of the following code in super_cache_count: > > if (sb->s_op && sb->s_op->nr_cached_objects) > total_objects = sb->s_op->nr_cached_objects(sb, sc); > > total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc); > total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc); > > total_objects = vfs_pressure_ratio(total_objects); > return total_objects; > > XFS was doing its job cleaning up inodes with the background mechanims > it has (m_reclaim_workqueue), but kernel also stopped cleaning up > readily available inodes after XFS. You mean the kernel stopped handing readily available inodes to XFS to free. ie. The VFS has to release unreferenced inodes before they are handed to XFS to clean up and free. IOWs, I suspect you biased vm.vfs_cache_pressure to turn off inode reclaim to effectively do the same thing as above. However, by stopping the feed of unreferenced inodes to XFS you stopped inode reclaim altogether and that caused other problems. What I suspect you want to do is bias vfs_cache_pressure the other way, so that it /agressively/ feeds inode reclaim and so prevent large numbers of inodes from building up in the cache. That way reclaim won't have as many dirty inodes to block on when a sustained memory shortage drives large amounts of direct reclaim into the shrinkers... > I'm not a kernel hacker and to be honest with you I don't even > understand all the nuances here. All I know is: > > 1. I have lots of page cache and terrible disks. > 2. I want to reclaim page cache and never touch disks in response to > memory reclaim. > 3. Direct reclaim will happen at some point, somebody will want a big > chunk of memory all at once. > 4. I'm probably ok with reclaiming clean xfs inodes synchronously in > reclaim path. > > This brings me to my final question: what should I do to avoid latency > in reclaim (direct or kswapd)? > > To reiterate the importance of this issue: we see interactive > applications with zero IO stall for multiple seconds in writes to > non-blocking sockets and page faults on newly allocated memory, while > 95GB of memory is in page cache. It really just sounds like you are allowing too many dirty inodes to build up in memory. You have a huge amount of memory but really slow disks - that's just asking for idata and metadata writeback latency problems. It sounds like, you need to cause inode cleaning to happen earlier/faster than it is happening now, because your disks are slow and so you can't allow a big backlog to build up. Reducing /proc/sys/fs/xfs/xfssyncd_centisecs can get background metadata writeback started earlier. As per Documentation/filesystems/xfs.txt: fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000) The interval at which the filesystem flushes metadata out to disk and runs internal cache cleanup routines. If this doesn't help, then I'd suggest taht your config is simply allowing too many dirty inodes to build up in memory. With XFS, the number of dirty inodes that can be held in memory at any point in time is bound by the size of the XFS log in the filesystem. Hence if you have hundreds of MB of journal space, then you can cache hundreds of thousands of dirty inodes in memory and so can take tens of minutes to write them back to slow disks. So if earlier writeback doesn't help, I'd suggest that using smaller logs with your filesystems is the only real option here.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx