Re: [Question] why not flush device cache at _vg_commit_raw

Zdenek Kabelac <zdenek.kabelac@xxxxxxxxx> · Wed, 24 Jan 2024 13:35:49 +0100

Dne 24. 01. 24 v 12:58 Anthony Iliopoulos napsal(a):
On Tue, Jan 23, 2024 at 06:50:01PM +0100, Zdenek Kabelac wrote:
Dne 23. 01. 24 v 17:42 Demi Marie Obenour napsal(a):
On Mon, Jan 22, 2024 at 03:52:57PM +0100, Zdenek Kabelac wrote:
Dne 22. 01. 24 v 14:46 Anthony Iliopoulos napsal(a):
On Mon, Jan 22, 2024 at 01:48:41PM +0100, Zdenek Kabelac wrote:
Dne 22. 01. 24 v 12:22 Su Yue napsal(a):
Hi lvm folks,
      Recently We received a report about the device cache issue after vgchange —deltag.
What confuses me is that lvm never calls fsync on block devices even at the end of commit phase.

IIRC, it’s common operations for userspace tools to call fsync/O_SYNC/O_DSYNC while writing
critical data. Yes, lvm2 opens devices with O_DIRECT if they support , but O_DIRECT doesn't
provide data was persistent to storage when write returns. The data can still be in the device cache,
If power failure happens in the timing, such critical metadata/data like vg metadata could be lost.

Is there any particular reason not to flush data cache at VG commit time?

Hi

It seems the call to 'dev_flush()' function got somehow lost over the time
of conversion to async aio usage - I'll investigate.

On the other hand the chance here of losing any data this way would be
really really very specific to some oddly behaving device.

There's no guarantee that data will be persisted to storage without
explicitly flushing the device data cache. Those are usually volatile
write-back caches, so the data aren't really protected against power
loss without fsyncing the blockdev.

At technical level modern storage devices 'should' have enough energy held
internally to be able to flush out all the caches in emergency cases to the
persistent storage. So unless we deal with some 'virtual' storage that may
fake various responses to IO handling - this should not be causing major
troubles.

This is only true for enterprise storage with power loss protection.
The vast majority of Qubes OS users use LVM with consumer storage, which
does not have power loss protection.  If this is unsafe, then Qubes OS
should switch to a different storage pool that flushes drive caches as
needed.

 From lvm2 perspective - there are first written metadata - then there is
usually a full flush of all I/O and suspend to the actual device - if there
is any device already active on such disk -  so even if there would be no
direct flush initiated by lvm2 itself - there is going to such on whenever
we update existing LVs.

Can you elaborate on that? Flushing IO does not imply flushing of the
device cache, but it is not clear what you mean by "suspend" here.

i.e. when you create a snapshot of an LV - the origin LV is being suspended,
so this operation goes with   'flush & fsfreeze' request.
Basically we skip these suspend flags only for 'device extension' where we 
intentionally do not want to flush all data - but we need to know think though 
some cases and how to properly submit fsync() for them.

We also may likely need to extend this also to some files maintained by lvm2
where we may likely go with fdatasync().

There is usually a stream of cache flushing operation whenever i.e.
thin-pool is synchronizing metadata or any app running of device is
synchronizing its data as well.

We cannot make any assumptions about what processes may be running and
if they are actually doing fsync on the partition. Also, on devices that
support FUA, data integrity operations are optimized by leveraging that
and global device cache is elided.

Note - it's not that we would want to depend on them. All I mean by this is - 
that in practice the race-window where the potential data remains only in 
disk's cache is very small - that's also likely the reason why we have not 
spotted it yet.

In our case this came in because LV tag manipulation wasn't properly
persisted in some HA failover scenario, but definitely not resulted to
actual data loss.

I'd be very interested in the more detailed description of this scenario how 
it's been observed and whether we can manage to write some simulation for this 
in our test suite  with monitoring via i.e. perf or something like this.

An alternative to fsync on the blockdev would be to do open the device
with O_DSYNC or submit io with RWF_DSYNC so that all writes are flushed
to the storage medium.

I guess our dev_flush() function is mostly handling all those cases properly
with the use of  ioctl(BLKFLSBUF).
The only problem is - it's usage somehow vanished - and even in the past it's 
been  basically used only for non-direct usage so likely still not correct.

Regards

Zdenek