Re: dm-writecache - Unexpected Data After Host Crash

Marc Smith <msmith626@xxxxxxxxx> · Fri, 16 Jun 2023 16:43:47 -0400

On Fri, Jun 16, 2023 at 12:33 PM Lukas Straub <lukasstraub2@xxxxxx> wrote:
>
> On Wed, 14 Jun 2023 17:29:17 -0400
> Marc Smith <msmith626@xxxxxxxxx> wrote:
>
> > Hi,
> >
> > I'm using dm-writecache via 'lvmcache' on Linux 5.4.229 (vanilla
> > kernel.org source). I've been testing my storage server -- I'm using a
> > couple NVMe drives in an MD RAID1 array that is the cache (fast)
> > device, and using a 12-drive MD RAID6 array as the origin (backing)
> > device.
> >
> > I noticed that when the host crashes (power loss, forcefully reset,
> > etc.) it seems the cached (via dm-writecache) LVM logical volume does
> > not contain the bits I expect. Or perhaps I'm missing something in how
> > I understand/expect dm-writecache to function...
> >
> > I change the auto-commit settings to larger values so the data on the
> > cache device is not flushed to the origin device:
> > # lvchange --cachesettings "autocommit_blocks=1000000000000"
> > --cachesettings "autocommit_time=3600000" dev_1_default/sys_dev_01
> >
> > Then populate the start of the device (cached LV) with zeros:
> > # dd if=/dev/zero of=/dev/dev_1_default/sys_dev_01 bs=1M count=10 oflag=direct
>
> Missing flush/fsync.
>
> > Force a flush from the cache device to the backing device (all zero's
> > in the first 10 MiB):
> > # dmsetup message dev_1_default-sys_dev_01 0 flush
> >
> > Now write a different pattern to the first 10 MiB:
> > # fio --bs=1m --direct=1 --rw=write --buffer_pattern=0xff
> > --ioengine=libaio --iodepth=1 --numjobs=1 --size=10M
> > --output-format=terse --name=/dev/dev_1_default/sys_dev_01
>
> Again, no flush/fsync is issued.

I'm doing direct I/O so I wasn't anticipating the need for a flush/fsync.

>
> > And then induce a reset:
> > # echo b > /proc/sysrq-trigger
> >
> > Now after the system boots back up, assemble the RAID arrays and
> > activate the VG, then examine the data:
> > # vgchange -ay dev_1_default
> > # dd if=/dev/dev_1_default/sys_dev_01 bs=1M iflag=direct count=10
> > status=noxfer | od -t x2
> > 0000000 0000 0000 0000 0000 0000 0000 0000 0000
> > *
> > 10+0 records in
> > 10+0 records out
> > 50000000
> >
> >
> > So I'm expecting all "ffff" in the first 10 MiB, but instead, I'm
> > getting what's on the origin device, zeros (not what was written to
> > the cache device).
> >
> > Obviously in a crash scenario (power loss, reset, panic, etc.) the
> > dirty data in the cache won't be flushed to the origin device,
> > however, I was expecting when the DM device started on the subsequent
> > boot (via activating the VG) that all of the dirty data would be
> > present -- it seems like it is not.
> >
> >
> > Thanks for any information/advice, it's greatly appreciated.
>
> This is the expected behavior. If you don't issue flushes, no guarantees
> are made about the durability of the newly written data.

Interesting... was not expecting that. I guess I was thrown by the use
of persistent media (SSD / PMEM). If dm-writecache has dirty data that
isn't flushed to the origin device yet (no flush/fsync from the
application) and we lose power, the data is gone... why not just use
volatile RAM for the cache then?

I'm still experimenting and learning the code, but from what I've seen
so far, the dirty data blocks do reside on the SSD/PMEM device, it's
just the entry map that lives in metadata that isn't up-to-date if a
crash / power loss occurs. I assume writing out all of the metadata on
each cache change would be very expensive in terms of I/O performance.

>
> >
> > --Marc
> >
> > --
> > dm-devel mailing list
> > dm-devel@xxxxxxxxxx
> > https://listman.redhat.com/mailman/listinfo/dm-devel
> >
>

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel