On Tue, 18 Sep 2018, Eric Sandeen wrote: > > is tight on disk space and doesn't care about performance). > > I think you may be conflating sector size with filesystem block size. > > ext4 makes no distinction between the two. > > XFS has both sector size (metadata atomic IO unit) and filesystem block size > (file data allocation unit) as configurable mkfs-time options. The sector size > can be smaller than, and up to, the filesystem block size. > > mkfs.xfs defaults to 4k filesystem blocks and device-physical-sector-sized > sectors, i.e. the largest atomic IO the device advertises, because XFS > metadata journaling relies on this IO atomicity. We allocate file data in > 4k chunks, and do atomic metadata IO in device-sector-sized chunks. You can have 512-byte metadata sectors and you can read and write them in 4k chunks (so that you avoid the read-modify-write logic in the SSDs). If data blocks are allocated on 4k boundary, there's no risk of metadata-vs-data buffer races. > ext4 doesn't - it's true - but I cannot help but believe that ext4 occasionally > gets harmed by this choice, because it's absolutely possible that a 4k > metadata write gets only partly-persisted if power fails on a 512/512 disk, > for example. In practice it seems to generally work out ok, but it is going > beyond what the device says it can guarantee. > > -Eric I implemented journal in the dm-integrity driver and I solved this problem with partial writes by tagging every 512-byte journal sector with an 8-byte tag. If the tags don't match, there was power failure during write and the partially written journal section will not be replayed. The journal is written using 4k-aligned writes because they perform better. ext4 solves this problem by using checksums. Mikulas