On Sun, May 01, 2016 at 08:19:44AM +1000, NeilBrown wrote: > On Sat, Apr 30 2016, Dave Chinner wrote: > > Indeed, blocking the superblock shrinker in reclaim is a key part of > > balancing inode cache pressure in XFS. If the shrinker starts > > hitting dirty inodes, it blocks on cleaning them, thereby slowing > > the rate of allocation to that which inodes can be cleaned and > > reclaimed. There are also background threads that walk ahead freeing > > clean inodes, but we have to throttle direct reclaim in this manner > > otherwise the allocation pressure vastly outweighs the ability to > > reclaim inodes. if we don't balance this, inode allocation triggers > > the OOM killer because reclaim keeps reporting "no progress being > > made" because dirty inodes are skipped. BY blocking on such inodes, > > the shrinker makes progress (slowly) and reclaim sees that memory is > > being freed and so continues without invoking the OOM killer... > > I'm very aware of the need to throttle allocation based on IO. I > remember when NFS didn't quite get this right and filled up memory :-) > > balance_dirty_pages() used to force threads to wait on the write-out of > one page for every page that they dirtied (or wait on 128 pages for every 128 > dirtied or whatever). This was exactly to provide the sort of > throttling you are talking about. > > We don't do that any more. It was problematic. I don't recall all the > reasons but I think that different backing devices having different > clearance rates was part of the problem. As the original architect of those changes (the IO-less dirty page throttling was an evolution of the algorithm I developed for Irix years before it was done in linux) I remember the reasons very well. Mostly it was to prevent what we termed "IO breakdown" where so many different foreground threads were dispatching writeback that it turned dirty page writeback into random IO instead of being nice, large sequential IOs. > So now we monitor clearance rates and wait for some number of blocks to > be written, rather than waiting for some specific blocks to be written. Right - we have a limited pool of flusher threads dispatching IO efficiently as possible, and foreground dirtying processes wait for that to do the work of cleaning pages. > We should be able to do the same thing to balance dirty inodes as > we do to balance dirty pages. No, we can't. Dirty inode state is deeply tied into filesystem implementations - it's intertwined with the journal operation in many filesystems and can't be separated easily. Indeed, some filesystem don't even use the VFS for tracking dirty inode state, nor do they implement the ->write_inode method. Hence the VFS northe inode cache shrinkers are able to determine if an inode is dirty, or if it is, trigger writeback of it. IOWs, inode caches are not unified in allocation, behaviour, design or implementation like the page cache is, and so balancing dirty inodes is likely not to be possible.... > >> If it could be changed > >> to just schedule the IO without waiting for it then I think this > >> would be safe to be called in any FS allocation context. It already > >> uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking > >> if the lock is held. > > > > We could, but then we have the same problem as the inode cache - > > there's no indication of progress going back to the memory reclaim > > subsystem, nor is reclaim able to throttle memory allocation back to > > the rate at which reclaim is making progress. > > > > There's feedback loops all throughout the XFS reclaim code - it's > > designed specifically that way - I made changes to the shrinker > > infrastructure years ago to enable this. It's no different to the > > dirty page throttling that was done at roughly the same time - > > that's also one big feedback loop controlled by the rate at which > > pages can be cleaned. Indeed, it was designed was based on the same > > premise as all the XFS shrinker code: in steady state conditions > > we can't allocate a resource faster than we can reclaim it, so we > > need to make reclaim as efficient at possible... > > You seem to be referring here to the same change that I was referred to > above, but seem to be seeing it from a different perspective. Sure, not many people have the same viewpoint as the preson who had to convince everyone else that it was a sound idea even before prototypes were written... > Waiting for inodes to be freed in important. Waiting for any one > specific inode to be freed is dangerous. Sure. But there's a difference between waiting on an inode you have no idea when you'll be able to reclaim it, versus waiting on IO completion for an inode you've already guaranteed can complete reclaim and are the only context that has access to the inode.... Not to mention that XFS also triggers an async inode reclaim worker thread to do non-blocking reclaim of inodes in the background, so while direct reclaim throttles (and hence prevents excessive concurrent direct reclaim from trashing the working set of clean inodes whilst ignoring dirty inodes) we still get appropriate amounts of inode cache reclaim occurring... > > The problem with this is that a single kswapd thread can't keep > > up with all of the allocation pressure that occurs. e.g. a > > 20-core intel CPU with local memory will be seen as a single > > node and so will have a single kswapd thread to do reclaim. > > There's a massive imbalance between maximum reclaim rate and > > maximum allocation rate in situations like this. If we want > > memory reclaim to run faster, we to be able to do more work > > *now*, not defer it to a context with limited execution > > resources. > > I agree it makes sense to do *work* locally on the core that sees > the need. What I want to avoid is *sleeping* locally. With direct reclaim, sleeping locally is the only way we can throttle it. If we want to avoid reclaim from doing this, then we need to move to a reclaim model similar to the IO-less dirty page throttling mechanism which directly limits reclaim concurrency and so individual subsystem shrinkers don't need to do this to prevent reclaim from trashing the working set. Indeed, I've proposed this in the past, and been told by various developers that it can't possibly work and, besides, fs developers know nothing about mm/ design.... I've got better things to do than bash my head against brick walls like this, even though I know such algorithms both perform and scale for memory reclaim because that's exactly where I got the original inspriation for IO-less dirty page throttling from: the Irix memory reclaim algorithms.... > How would > it be to make evict_inode non-blocking? We can't without completely redesigning how iput_final() and evict() work and order operations. Indeed, by the time we call ->evict_inode(), we've already marked the inode as I_FREEING, removed the inode from the superblock inode list, dirty writeback lists, the LRU lists and blocked waiting for any inode writeback in progress to complete. We'd also have to redesign the shrinker code, because it does bulk operations to isolate inodes from the LRU onto temporary lists after marking them I_FREEING, which then does evict() calls from a separate context. We'd then have to change all that code to handle inodes that can't be reclaimed and re-insert them into the LRUs approriately, and that would have to be managed in a way that we don't end up live-locking the shrinker lru scanning on unreferenced inodes that cannot be evicted. I'm also not sure whether dispose_list() function can add inodes back onto the LRU safely because contexts calling evict() make the assumption the inode is no longer on the LRU and don't need to access LRU locks at all... Also, evict() is called when unmounting the filesystem, and so in those cases we actually want it to block and do everything that is necessary to free the inode, so this would introduce more complex interactions depending on caller context.... > It would do as much work > as it can, which in many cases would presumably complete the whole > task. But if it cannot progress for some reason, it returns > -EAGAIN and then the rest gets queued for later. Might that work, > or is there often still lots of CPU work to be done after things > have been waited for? Freeing the inode occurs after it's been waited for, usually. i.e. no wait, no inode being freed and no feedback for throttling of the inode reclaim rate... FWIW, I don't think making evict() non-blocking is going to be worth the effort here. Making memory reclaim wait on a priority ordered queue while asynchronous reclaim threads run reclaim as efficiently as possible and wakes waiters as it frees the memory the waiters require is a model that has been proven to work in the past, and this appears to me to be the model you are advocating for. I agree that direct reclaim needs to die and be replaced with something entirely more predictable, controllable and less prone to deadlock contexts - you just need to convince the mm developers that it will perform and scale better than what we have now. In the mean time, having a slightly more fine grained GFP_NOFS equivalent context will allow us to avoid the worst of the current GFP_NOFS problems with very little extra code. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html