On 22/01/2025 06:42, Christoph Hellwig wrote:
On Fri, Jan 17, 2025 at 10:49:34AM -0800, Darrick J. Wong wrote:
The trouble is that the br_startoff attribute of cow staging mappings
aren't persisted on disk anywhere, which is why exchange-range can't
handle the cow fork. You could open an O_TMPFILE and swap between the
two files, though that gets expensive per-io unless you're willing to
stash that temp file somewhere.
Needing another inode is better than trying to steal ranges from the
actual inode we're operating on. But we might just need a different
kind of COW staging for that.
At this point I think we should slap the usual EXPERIMENTAL warning on
atomic writes through xfs and let John land the simplest multi-fsblock
untorn write support, which only handles the corner case where all the
stars are <cough> aligned; and then make an exchange-range prototype
and/or all the other forcealign stuff.
That is the worst of all possible outcomes. Combing up with an
atomic API that fails for random reasons only on aged file systems
is literally the worst thing we can do. NAK.
I did my own quick PoC to use CoW for misaligned blocks atomic writes
fallback.
I am finding that the block allocator is often giving misaligned blocks
wrt atomic write length, like this:
# xfs_bmap -v mnt/file
mnt/file:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..20479]: 192..20671 0 (192..20671) 20480 000000
#
#
#xfs_io -d -C "pwrite -b 64k -V 1 -A -D 0 64k" mnt/file
#xfs_bmap -v mnt/file
mnt/file:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..127]: 20672..20799 0 (20672..20799) 128 000000
1: [128..20479]: 320..20671 0 (320..20671) 20352 000000
#
#xfs_io -d -C "pwrite -b 64k -V 1 -A -D 0 64k" mnt/file
#xfs_bmap -v mnt/file
mnt/file:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..127]: 20928..21055 0 (20928..21055) 128 000000
1: [128..20479]: 320..20671 0 (320..20671) 20352 000000
In this case we would not use HW offload (as no start blocks are
64K-aligned), which will affect performance.
Since we are not considering forcealign ATM, can we still consider some
other alignment hint to the block allocator? It could be similar to how
stripe alignment is handled.
Some other thoughts:
- I am not sure what atomic write unit max we would now use.
- Anything written back with CoW/exchange range will need FUA to ensure
that the write is fully persisted. Otherwise I think that not using FUA
could mean that the data is reported written by the disk but may only be
partially persisted from a power fail later.