On Wed, Jul 10, 2024 at 09:12:58AM +1000, NeilBrown wrote: > On Thu, 04 Jul 2024, Christoph Hellwig wrote: > > On Wed, Jul 03, 2024 at 09:29:00PM +1000, NeilBrown wrote: > > > I know nothing of this stance. Do you have a reference? > > > > No particular one. > > > > > I have put a modest amount of work into ensure NFS to a server on the > > > same machine works and last I checked it did - though I'm more > > > confident of NFSv3 than NFSv4 because of the state manager thread. > > > > How do you propagate the NOFS flag (and NOIO for a loop device) to > > the server an the workqueues run by the server and the file system > > call by it? How do you ensure WQ_MEM_RECLAIM gets propagate to > > all workqueues that could be called by the file system on the > > server (the problem kicking off this discussion)? > > > > Do we need to propagate these? > > NOFS is for deadlock avoidance. A filesystem "backend" (Dave's term - I > think for the parts of the fs that handle write-back) might allocate > memory, that might block waiting for memory reclaim, memory reclaim > might re-enter the filesystem backend and might block on a lock (or > similar) held while allocating memory. NOFS breaks that deadlock. > > The important thing here isn't the NOFS flag, it is breaking any > possible deadlock. NOFS doesn't "break" any deadlocks. It simply prevents recursion from one filesystem context to another. We don't have to use NOFS if recursion is safe and won't deadlock. That is, it may be safe for a filesystem to use GFP_KERNEL allocations in it's writeback path. If the filesystem doesn't implement ->writepage (like most of the major filesystems these days) there is no way for memory reclaim to recurse back into the fs writeback path. Hence GFP_NOFS is not needed in writeback context to prevent reclaim recursion back into the filesystem writeback path.... And the superblock shrinkers can't deadlock - they are non-blocking and only act on unreferenced inodes. Hence any code that has a locked inode is either evicting an unreferenced inode or holds a reference to the inode. If we are doing an allocation with eithe rof those sorts of inodes locked, there is no way that memory reclaim recursion can trip over the locked inode and deadlock. So between the removal of ->writepage, non-blocking shrinkers, and scoped NOIO context for loop devices, I'm not sure there are any generic reclaim recursion paths that can actually deadlock. i.e. GFP_NOFS really only needs to be used if the filesystem itself cannot safely recurse back into itself. > Layered filesystems introduce a new complexity. Nothing new about layered filesystems - we've been doing this for decades... > The backend for one > filesystem can call into the front end of another filesystem. That > front-end is not required to use NOFS and even if we impose > PF_MEMALLOC_NOFS, the front-end might wait for some work-queue action > which doesn't inherit the NOFS flag. > > But this doesn't necessarily matter. Calling into the filesystem is not > the problem - blocking waiting for a reply is the problem. It is > blocking that creates deadlocks. So if the backend of one filesystem > queues to a separate thread the work for the front end of the other > filesystem and doesn't wait for the work to complete, then a deadlock > cannot be introduced. > > /dev/loop uses the loop%d workqueue for this. loop-back NFS hands the > front-end work over to nfsd. The proposed localio implementation uses a > nfslocaliod workqueue for exactly the same task. These remove the > possibility of deadlock and mean that there is no need to pass NOFS > through to the front-end of the backing filesystem. I think this logic is fundamentally flawed. Pushing IO submission to a separate thread context which runs them in GFP_KERNEL context does not help if the deadlock occurs during IO submission. With loop devices, there's a "global" lock in the lower filesystem on the loop device - the image file inode lock. The IO issued by the loop device will -always- hit the same inode and the same inode locks. Hence if we do memory allocation with an inode lock held exclusive in the lower filesystem (e.g. a page cache folio for a buffered write), we cannot allow memory reclaim during any allocation with the image file inode locked to recurse into the upper filesystem. If the upper filesystem then performs an operation that requires IO to be submitted and completed to make progress then we have a deadlock condition due to recursion from the lower to upper filesystem regardless of the fact that the lower IO submission is run from a different task. Hence the loop device sets up the backing file mapping as: lo->lo_backing_file = file; lo->old_gfp_mask = mapping_gfp_mask(mapping); mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS)); GFP_NOIO context. It also sets up worker task context as: current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO; GFP_NOIO context. IOWs, all allocation in the IO submission path is explicitly GFP_NOIO to prevent any sort of reclaim recursion into filesysetms or the block layer. That's the only sane thing to do, because multi-filesystem deadlocks are an utter PITA to triage and solve... Keep in mind that PF_LOCAL_THROTTLE also prevents IO submission deadlocks in the lower filesystem. If the lower filesystem IO submission dirties pages (i.e. buffered writes) it can get throttled on the dirty page threshold. If it get's throttled like this trying to clean dirty pages from the upper filesystem we have a deadlock. The localio submission task will need to prevent that deadlock, too. IOWs, just moving IO submission to another thread does not avoid the possibility of lower-to-upper filesystem recursion or lower filesystem dirty page throttling deadlocks. > Note that there is a separate question concerning pageout to a swap > file. pageout needs more than just deadlock avoidance. It needs > guaranteed progress in low memory conditions. It needs PF_MEMALLOC (or > mempools) and that cannot be finessed using work queues. I don't think > that Linux is able to support pageout through layered filesystems. I don't think we ever want to go there. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx