On Wed, Oct 03, 2018 at 06:45:13AM +0300, Amir Goldstein wrote: > On Wed, Oct 3, 2018 at 2:14 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > [...] > > > Seems like freezing any of the layers if overlay itself is not frozen > > > is not a good idea. > > > > That's something we can't directly control. e.g. lower filesystem is > > on a DM volume. DM can freeze the lower fileystem through the block > > device when a dm command is run. It may well be that the admins that > > set up the storage and filesystem layer have no idea that there are > > now overlay users on top of the filesystem they originally set up. > > Indeed, the admins may not even know that dm operations freeze > > filesystems because it happens completely transparently to them. > > > > I don't think we should be binding the stacked filesystem issues with > the stacked block over fs issues. It's the same problem. Hacking a one-off solution to hide a specific overlay symptom does not address the root problem. And, besides, if you stack like this: overlay lower_fs loopback dev loop img fs And freeze the loop img fs, overlay can still get stuck in it's shrinker because the the lower_fs gets stuck doing IO on the frozen loop img fs. i.e. it's the same issue - kswapd will get stuck doing reclaim from the overlay shrinker. > The latter is more complex to solve > generally and has by design non limited stack depth. The former has > limited stack depth (2) and each sb knows its own stack depth, which > is already used in overlay to annotate lockdep correctly. > > If vfs stores a reverse tree of stacked fs dependencies, then individual > sb freeze can be solved. Don't make me mention bind mounts... :/ > Drawing the fire away from overlayfs.. I personally find the behavior that > a process that only has files open for read could block when filesystem is > frozen somewhat unexpected to users (even if I can expect it). Filesystem reads have always been able to modify the file (e.g. atime updates). Not to mention filesystem reads require memory allocation, and that means any GFP_KERNEL direct reclaim can get stuck on a frozen filesystem if that filesystem hasn't properly cleared out it's dangerous reclaimable objects when freezing. > I wonder out loud if it wouldn't be friendlier for any filesystem to defer > "garbage collection" (e.g. truncate deleted inode blocks) to thawing time, https://marc.info/?l=linux-xfs&m=153022904909523&w=2 Been on the list of "nice to have" unlink optimisations for XFS since 2008. But it's a performance optimisation and precursor for offlining AGs for online repair, not something we've ever considered as needed for correctness or to prevent deadlocks. > just as those operations are already run on mount (post crash) anyway. That's a completely different context - log recovery is much more constrained in the amount of work it needs to do and has much more freedom for handling errors (i.e. it can just leak bad unlinked inodes). Runtime deferral of post-unlink, post-reference inode reclaim is a *lot* more complex than processing pending unlinks in log recovery. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx