Re: Uneccesary flushes waking up suspended disks

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 21 Mar 2024 08:58:16 +1100

On Wed, Mar 20, 2024 at 08:38:52AM -0400, Phillip Susi wrote:
> Dave Chinner <david@xxxxxxxxxxxxx> writes:
> 
> > That's what I expected - I would have been surprised if you found
> > problems across multiple filesystems...
> 
> How do the other filesystems know they don't need to issue a flush?
> While this particular method of reproducing the problem ( sync without
> touching the filesystem ) only shows on ext4, I'm not sure this isn't
> still a broader problem.

It may well be a broader problem, but it's a filesystem
implementation issue and not a generic VFS issue. Unfortunately,
without knowly a lot about storage stacks and filesystem
implementations, it's hard to understand why this is the case.
I'll use XFS as an example of how a filesystem can know if it
needs to issue cache flushes or not on sync.

> Say that a program writes some data to a file.  Due to cache pressure,
> the dirty pages get written to the disk.

Now the filesystem is idle, with no dirty data or metadata.

In the case of XFS, this will begin the process of "covering the
log". This takes 60-90s (3 consecutive log sync worker executions),
and it involves the journal updating and logging the superblock and
writing it back to mark the journal as empty.

These log writes are integrity writes (REQ_PREFLUSH|REQ_FUA) and so
issuing a log write guarantee all data written and completed will be
stable on disk before the log write is -submitted-. This is
guaranteed via the pre-submission cache flush (REQ_PREFLUSH) that
provides completion-to-submission IO ordering via pre-flush
semantics. The log write itself is guaranteed to be stable on disk
before it completes (REQ_FUA), and so when the journal writes
complete, all data and metadata is guaranteed to be on stable
storage.

So while this covering process takes up to 90s after the last change
in memory has been written to disk, after the first 30s of idle
time, XFS has already issued cache flushes to ensure all data and
metadata is stable on disk.  The device can be safely powered down
at that time without concern.

Put simply: for general purpose filesystems, it's considered a bug
to leave data and/or metadata in volatile caches indefinitely,
because that guarantees data loss on crash and/or power failure will
occur...

> Some time later, the disk is
> runtime suspended ( which flushes its write cache ).

Which is a no-op on devices with XFS filesystems on them, because
the cache should already be clean.

> After that,
> someone does some kind of sync ( whole fs or individual file ).  Doesn't
> the FS *have* to issue a flush at that point?

No, because the filesystem often already knows that it is completely
clean on stable storage.
Hence we don't need to do anything when a sync is run, not even a
cache flush...

> Even though there is
> nothing in the disk's cache, the FS doesn't know that.

On the contrary: filesystems need to know if they are clean all the
way down to stable storage - the filesystem layer is what iprovides
the guarantees for user data integrity, so they *must* understand
and control the volatile caches below them in the storage stack
correctly.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx