On Thu, Mar 26, 2020 at 07:51:53PM -0700, Darrick J. Wong wrote: > On Fri, Mar 27, 2020 at 01:27:14PM +1100, Dave Chinner wrote: > > On Thu, Mar 26, 2020 at 06:45:58PM -0700, Darrick J. Wong wrote: > > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > > > A customer reported rcu stalls and softlockup warnings on a computer > > > with many CPU cores and many many more IO threads trying to write to a > > > filesystem that is totally out of space. Subsequent analysis pointed to > > > the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb, > > > which causes a lot of wb_writeback_work to be queued. The writeback > > > worker spends so much time trying to wake the many many threads waiting > > > for writeback completion that it trips the softlockup detector, and (in > > > this case) the system automatically reboots. > > > > That doesn't sound right. Each writeback work that is queued via > > sync_inodes_sb should only have a single process waiting on it's > > completion. And how many threads do you actually have to need to > > wake up for it to trigger a 10s soft-lockup timeout? > > > > More detail, please? > > It's a two socket 64-core system with some sort of rdma/infiniband magic > and somewhere between 600-900 processes doing who knows what with the > magic. Each of those threads *also* is writing trace data to its own > separate trace file (three private log files per process). Hilariously > they never check the return code from write() so they keep pounding the > system forever. <facepalm> Ah, another game of ye olde "blame the filesystem because it's the first to complain"... > (I don't know what the rdma/infiniband magic is, they won't tell me.) > > When the filesystem fills up all the way (it's a 10G fs with 8,207 > blocks free) they keep banging away on it until something finally dies. > > I tried writing a dumb fstest to simulate the log writer part, but that > never succeeds in triggering the rcu stalls. Which means it probably requires a bunch of other RCU magic to be done by other part of the system to trigger it... > If you want the gory dmesg details I can send you some dmesg log. No need, won't be able to read it anyway because facepalm... > > > In addition, they complain that the lengthy xfs_flush_inodes scan traps > > > all of those threads in uninterruptible sleep, which hampers their > > > ability to kill the program or do anything else to escape the situation. > > > > > > Fix this by replacing the full filesystem flush (which is offloaded to a > > > workqueue which we then have to wait for) with directly flushing the > > > file that we're trying to write. > > > > Which does nothing to flush -other- outstanding delalloc > > reservations and allow the eofblocks/cowblock scan to reclaim unused > > post-EOF speculative preallocations. > > > > That's the purpose of the xfs_flush_inodes() - without it we can get > > very premature ENOSPC, especially on small filesystems when writing > > largish files in the background. So I'm not sure that dropping the > > sync is a viable solution. It is actually needed. > > Yeah, I did kinda wonder about that... > > > Perhaps we need to go back to the ancient code thatonly allowed XFS > > to run a single xfs_flush_inodes() at a time - everything else > > waited on the single flush to complete, then all returned at the > > same time... > > That might work too. Admittedly it's pretty silly to be running this > scan over and over and over considering that there's never going to be > any more free space. Actually, what if we just rate limit the calls? Once a second would probably do the trick just fine - after the first few seconds there'll be no space left to reclaim, anyway... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx