On Thu, Oct 15, 2020 at 10:56:11AM -0700, Darrick J. Wong wrote: > On Thu, Oct 15, 2020 at 06:21:51PM +1100, Dave Chinner wrote: > > @@ -74,6 +196,8 @@ xfs_buftarg_alloc( > > btp->bt_mount = mp; > > btp->bt_fd = libxfs_device_to_fd(bdev); > > btp->bt_bdev = bdev; > > + btp->bt_psi_fd = -1; > > + btp->bt_exiting = false; > > > > if (xfs_buftarg_setsize_early(btp)) > > goto error_free; > > @@ -84,8 +208,13 @@ xfs_buftarg_alloc( > > if (percpu_counter_init(&btp->bt_io_count, 0, GFP_KERNEL)) > > goto error_lru; > > > > + if (xfs_buftarg_mempressue_init(btp)) > > So what happens if PSI isn't enabled or procfs isn't mounted yet? > xfs_repair just ... fails? That seems disappointing, particularly if > the admin is trying to fix a dead root fs from the initramfs premount > shell and /proc isn't set up yet. Yes, right now it just fails. I'm more interested right now in getting the new infrastructure working such that the kernel buffer cache "just works" when there's more metadata than RAM to cache it in. > Hmm, looks like Debian actually /does/ set up procfs nowadays. Still, > if we're going to add a hard requirement on CONFIG_PSI=y and > CONFIG_PSI_DEFAULT_DISABLED=n, we need to advertise this kind of loudly. > > (Personally, I thought that if there's no pressure stall information, > we'd just fall back to not having a shrinker and daring the system to > OOM us like it does now...) Well, the existing buffer cache does have a shrinker mechanism - it will shake the cache down when it is full to free up old buffers. That's what all the MRU lists and buffer priority stuff in the repair prefetch code is all about. repair tries to bound the maximum size of the buffer cache and prevent OOM that way. If it calculates that the memory requirement is larger than RAM, that's when it gets into OOM trouble because we still allow it to use lots of memory and then just hope... I kind of want to get away from all those messy static heuristics. I'd much prefer that we do dynamic cache growth detection and size calculations in repair and determine if we should purge the cache at the end of each AG or retain it in RAM. i.e. if ((per ag cache size * no. of AGs) > 75% RAM) then purge the AG cache when the phase scan is done. This way we run with minimal caching (just what is needed for prefetching to be efficient) when it is likely we can't fit all the metadata in RAM, and otherwise we behave like we currently do. That sort of setup will go a long way to avoiding OOM kill and the need for actual memory shrinkers to activate. This mode could be activated if the PSI infomration is not there, hence might also solve most of the rescue situation problems. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx