On Wed, Mar 20, 2024 at 08:38:52AM -0400, Phillip Susi wrote: > Dave Chinner <david@xxxxxxxxxxxxx> writes: > > > That's what I expected - I would have been surprised if you found > > problems across multiple filesystems... > > How do the other filesystems know they don't need to issue a flush? > While this particular method of reproducing the problem ( sync without > touching the filesystem ) only shows on ext4, I'm not sure this isn't > still a broader problem. It may well be a broader problem, but it's a filesystem implementation issue and not a generic VFS issue. Unfortunately, without knowly a lot about storage stacks and filesystem implementations, it's hard to understand why this is the case. I'll use XFS as an example of how a filesystem can know if it needs to issue cache flushes or not on sync. > Say that a program writes some data to a file. Due to cache pressure, > the dirty pages get written to the disk. Now the filesystem is idle, with no dirty data or metadata. In the case of XFS, this will begin the process of "covering the log". This takes 60-90s (3 consecutive log sync worker executions), and it involves the journal updating and logging the superblock and writing it back to mark the journal as empty. These log writes are integrity writes (REQ_PREFLUSH|REQ_FUA) and so issuing a log write guarantee all data written and completed will be stable on disk before the log write is -submitted-. This is guaranteed via the pre-submission cache flush (REQ_PREFLUSH) that provides completion-to-submission IO ordering via pre-flush semantics. The log write itself is guaranteed to be stable on disk before it completes (REQ_FUA), and so when the journal writes complete, all data and metadata is guaranteed to be on stable storage. So while this covering process takes up to 90s after the last change in memory has been written to disk, after the first 30s of idle time, XFS has already issued cache flushes to ensure all data and metadata is stable on disk. The device can be safely powered down at that time without concern. Put simply: for general purpose filesystems, it's considered a bug to leave data and/or metadata in volatile caches indefinitely, because that guarantees data loss on crash and/or power failure will occur... > Some time later, the disk is > runtime suspended ( which flushes its write cache ). Which is a no-op on devices with XFS filesystems on them, because the cache should already be clean. > After that, > someone does some kind of sync ( whole fs or individual file ). Doesn't > the FS *have* to issue a flush at that point? No, because the filesystem often already knows that it is completely clean on stable storage. Hence we don't need to do anything when a sync is run, not even a cache flush... > Even though there is > nothing in the disk's cache, the FS doesn't know that. On the contrary: filesystems need to know if they are clean all the way down to stable storage - the filesystem layer is what iprovides the guarantees for user data integrity, so they *must* understand and control the volatile caches below them in the storage stack correctly. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx