On Tue 03-09-24 08:44:16, Theodore Ts'o wrote: > On Tue, Sep 03, 2024 at 02:34:05PM +0800, Yafang Shao wrote: > > > > When setting GFP_NOFAIL, it's important to not only enable direct > > reclaim but also the OOM killer. In scenarios where swap is off and > > there is minimal page cache, setting GFP_NOFAIL without __GFP_FS can > > result in an infinite loop. In other words, GFP_NOFAIL should not be > > used with GFP_NOFS. Unfortunately, many call sites do combine them. > > For example: > > > > XFS: > > > > fs/xfs/libxfs/xfs_exchmaps.c: GFP_NOFS | __GFP_NOFAIL > > fs/xfs/xfs_attr_item.c: GFP_NOFS | __GFP_NOFAIL > > > > EXT4: > > > > fs/ext4/mballoc.c: GFP_NOFS | __GFP_NOFAIL > > fs/ext4/extents.c: GFP_NOFS | __GFP_NOFAIL > > > > This seems problematic, but I'm not an FS expert. Perhaps Dave or Ted > > could provide further insight. > > GFP_NOFS is needed because we need to signal to the mm layer to avoid > recursing into file system layer --- for example, to clean a page by > writing it back to the FS. Since we may have taken various file > system locks, recursing could lead to deadlock, which would make the > system (and the user) sad. > > If the mm layer wants to OOM kill a process, that should be fine as > far as the file system is concerned --- this could reclaim anonymous > pages that don't need to be written back, for example. And we don't > need to write back dirty pages before the process killed. So I'm a > bit puzzled why (as you imply; I haven't dug into the mm code in > question) GFP_NOFS implies disabling the OOM killer? Yes, because there might be a lot of fs pages pinned while performing NOFS allocation and that could fire the OOM killer way too prematurely. This has been quite some time ago since this was introduced but I do remember workloads hitting that. Also there is usually kswapd making sufficient progress to move forward. There are cases where kswapd is completely stuck and other __GFP_FS allocations triggering full direct reclaim or background kworkers freeing some memory and OOM killer doesn't have good enough picture to make an educated guess the oom killer is the only available way forward. A typical example would be a workload that would care is trashing but still making a slow progress which is acceptable which is acceptable because the most important workload makes a decent progress (the working set fits in or is mlocked) and rebuilding the state is more harmfull than a slow IO. -- Michal Hocko SUSE Labs