On Wed, May 11, 2022 at 12:16:57PM +1000, Chris Dunlop wrote: > On Wed, May 11, 2022 at 11:36:54AM +1000, Dave Chinner wrote: > > I think that's the thing that some people have missed in in this > > thread - I've know for a while now the scope of problems blocking > > flushes from statfs() can cause - any issue with background > > inodegc not making progress can deadlock the filesystem. I've lost > > count of the number of times I had inodegc go wrong or crash and the > > only possible recovery was to power cycle because nothing could be > > executed from the command line. > > > > That's because statfs() appears to be used in some commonly used > > library function and so many utility programs on the system will get > > stuck and be unable to run when inodegc dies, deadlocks, or takes a > > real long time to make progress. > > > > Hence I didn't need to do any analysis of Chris' system to know > > exactly what was going on - I've seen it many, many times myself and > > have work in progress that avoids those issues with inodegc work > > that never completes. > > > > IOWs, everything is understood, fixes are already written, and > > there's no actual threat of data loss or corruption from this issue. > > It's just lots of stuff getting stuck behind a long running > > operation... > > Your patches are stuck behind other long running priorities, or the patches > address an issue of stuff getting stuck? Or, of course, both? ;-) I was refering to the way your system got blocked up - lots of stuff stuck behind a long running operation. :) As for priorities, the queue/flushing fixes are part of a body of work in progress that I benched for 3 months when agreeing to take over maintenance of the kernel code for the current dev cycle. My priorities over recent weeks have been to get everyone else's work reviewed, tested, improved, and merged, not focusing on progressing my own projects. As such, I wasn't considering changing inoedegc queuing in the next merge window at all - it would just be a part of the larger body of work when it was ready. Your problems have made it apparent that we really need those problems fixed ASAP (i.e. the next merge cycle) hence the priority bump.... > Out of interest, would this work also reduce the time spent mounting in my > case? I.e. would a lot of the work from my recovery mount be punted off to a > background thread? No. log recovery will punt the remaining inodegc work to background threads so it might get slightly parallelised, but we have a hard barrier between completing log recovery work and completing the mount process at the moment. Hence we wait for inodegc to complete before log recovery is marked as complete. In theory we could allow background inodegc to bleed into active duty once log recovery has processed all the unlinked lists, but that's a change of behaviour that would require a careful audit of the last part of the mount path to ensure that it is safe to be running concurrent background operations whilst complete mount state updates. This hasn't been on my radar at all up until now, but I'll have a think about it next time I look at those bits of recovery. I suspect that probably won't be far away - I have a set of unlinked inode list optimisations that rework the log recovery infrastructure near the top of my current work queue, so I will be in that general vicinity over the next few weeks... Regardless, the inodegc work is going to be slow on your system no matter what we do because of the underlying storage layout. What we need to do is try to remove all the places where stuff can get blocked on inodegc completion, but that is somewhat complex because we still need to be able to throttle queue depths in various situations. As always, there is plenty of scope for improvement - the question is working out which improvements are the ones we actually need first... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx