On Sun, Mar 29, 2020 at 05:16:02PM -0700, Darrick J. Wong wrote: > On Mon, Mar 30, 2020 at 09:08:02AM +1100, Dave Chinner wrote: > > On Sun, Mar 29, 2020 at 10:22:09AM -0700, Darrick J. Wong wrote: > > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > > > A customer reported rcu stalls and softlockup warnings on a computer > > > with many CPU cores and many many more IO threads trying to write to a > > > filesystem that is totally out of space. Subsequent analysis pointed to > > > the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb, > > > which causes a lot of wb_writeback_work to be queued. The writeback > > > worker spends so much time trying to wake the many many threads waiting > > > for writeback completion that it trips the softlockup detector, and (in > > > this case) the system automatically reboots. > > > > > > In addition, they complain that the lengthy xfs_flush_inodes scan traps > > > all of those threads in uninterruptible sleep, which hampers their > > > ability to kill the program or do anything else to escape the situation. > > > > > > If there's thousands of threads trying to write to files on a full > > > filesystem, each of those threads will start separate copies of the > > > inode flush scan. This is kind of pointless since we only need one > > > scan, so rate limit the inode flush. > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > --- > > > fs/xfs/xfs_mount.h | 1 + > > > fs/xfs/xfs_super.c | 14 ++++++++++++++ > > > 2 files changed, 15 insertions(+) > > > > > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > > > index 88ab09ed29e7..50c43422fa17 100644 > > > --- a/fs/xfs/xfs_mount.h > > > +++ b/fs/xfs/xfs_mount.h > > > @@ -167,6 +167,7 @@ typedef struct xfs_mount { > > > struct xfs_kobj m_error_meta_kobj; > > > struct xfs_error_cfg m_error_cfg[XFS_ERR_CLASS_MAX][XFS_ERR_ERRNO_MAX]; > > > struct xstats m_stats; /* per-fs stats */ > > > + struct ratelimit_state m_flush_inodes_ratelimit; > > > > > > struct workqueue_struct *m_buf_workqueue; > > > struct workqueue_struct *m_unwritten_workqueue; > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > > > index 68fea439d974..abf06bf9c3f3 100644 > > > --- a/fs/xfs/xfs_super.c > > > +++ b/fs/xfs/xfs_super.c > > > @@ -528,6 +528,9 @@ xfs_flush_inodes( > > > { > > > struct super_block *sb = mp->m_super; > > > > > > + if (!__ratelimit(&mp->m_flush_inodes_ratelimit)) > > > + return; > > > + > > > if (down_read_trylock(&sb->s_umount)) { > > > sync_inodes_sb(sb); > > > up_read(&sb->s_umount); > > > @@ -1366,6 +1369,17 @@ xfs_fc_fill_super( > > > if (error) > > > goto out_free_names; > > > > > > + /* > > > + * Cap the number of invocations of xfs_flush_inodes to 16 for every > > > + * quarter of a second. The magic numbers here were determined by > > > + * observation neither to cause stalls in writeback when there are a > > > + * lot of IO threads and the fs is near ENOSPC, nor cause any fstest > > > + * regressions. YMMV. > > > + */ > > > + ratelimit_state_init(&mp->m_flush_inodes_ratelimit, HZ / 4, 16); > > > + ratelimit_set_flags(&mp->m_flush_inodes_ratelimit, > > > + RATELIMIT_MSG_ON_RELEASE); > > > > Urk. > > > > RATELIMIT_MSG_ON_RELEASE prevents "callbacks suppressed" > > messages when rate limiting was active and resets via __rate_limit(). > > However, in ratelimit_state_exit(), that flag -enables- printing > > "callbacks suppressed" messages when rate limiting was active and is > > reset. > > > > Same flag, exact opposite behaviour... > > > > The comment says it's behaviour is supposed to match that of > > ratelimit_state_exit() (i.e. print message on ratelimit exit), so I > > really can't tell if this is correct/intended usage or just API > > abuse.... > > This flag (AFAICT) basically means "summarize skipped calls later", > where later is when _exit is called. It's very annoying that this > printk thing is mixed in with what otherwise is a simple ratelimiting > mechanism, since there isn't much to be gained by spamming dmesg every > time a buffered write hits ENOSPC, and absolutely nothing to be gained > by logging that at umount time (with comm being the umount process!) > > Since there's no design documentation for how the ratelimiting system > works, the best I can do is RTFS and do whatever magic gets the outcome > I want (which is to set the flag and skip calling _exit. Only one of > the ratelimit state users calls ratelimit_state_exit, so it's apparently > not required. Yeah, that's pretty much where I got to with it - I wasn't at all sure if there was something obvious I was missing or it was just another piece of poorly written code... > This all is poor engineering practice, but you /did/ suggest > ratelimiting (on IRC) and I don't want to go reimplementing ratelimit.c > either, and it /does/ fix the xfs_flush_inodes flooding problems. *nod* Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> -- Dave Chinner david@xxxxxxxxxxxxx