Re: dm-writecache issue

Mikulas Patocka <mpatocka@xxxxxxxxxx> · Tue, 18 Sep 2018 13:15:28 -0400 (EDT)

On Tue, 18 Sep 2018, Eric Sandeen wrote:

> > is tight on disk space and doesn't care about performance).
> 
> I think you may be conflating sector size with filesystem block size.
> 
> ext4 makes no distinction between the two.
> 
> XFS has both sector size (metadata atomic IO unit) and filesystem block size
> (file data allocation unit) as configurable mkfs-time options. The sector size
> can be smaller than, and up to, the filesystem block size.
> 
> mkfs.xfs defaults to 4k filesystem blocks and device-physical-sector-sized
> sectors, i.e. the largest atomic IO the device advertises, because XFS
> metadata journaling relies on this IO atomicity.  We allocate file data in
> 4k chunks, and do atomic metadata IO in device-sector-sized chunks.

You can have 512-byte metadata sectors and you can read and write them in 
4k chunks (so that you avoid the read-modify-write logic in the SSDs). If 
data blocks are allocated on 4k boundary, there's no risk of 
metadata-vs-data buffer races.

> ext4 doesn't - it's true - but I cannot help but believe that ext4 occasionally
> gets harmed by this choice, because it's absolutely possible that a 4k
> metadata write gets only partly-persisted if power fails on a 512/512 disk,
> for example.  In practice it seems to generally work out ok, but it is going
> beyond what the device says it can guarantee.
> 
> -Eric

I implemented journal in the dm-integrity driver and I solved this problem
with partial writes by tagging every 512-byte journal sector with an
8-byte tag. If the tags don't match, there was power failure during write
and the partially written journal section will not be replayed. The
journal is written using 4k-aligned writes because they perform better.

ext4 solves this problem by using checksums.

Mikulas