On 22/01/2025 23:51, Dave Chinner wrote:
I did my own quick PoC to use CoW for misaligned blocks atomic writes
fallback.
I am finding that the block allocator is often giving misaligned blocks wrt
atomic write length, like this:
Of course - I'm pretty sure this needs force-align to ensure that
the large allocated extent is aligned to file offset and hardware
atomic write alignment constraints....
Since we are not considering forcealign ATM, can we still consider some
other alignment hint to the block allocator? It could be similar to how
stripe alignment is handled.
Perhaps we should finish off the the remaining bits needed to make
force-align work everywhere before going any further?
forcealign implementation is just about finished from my PoV, but needs
more review.
However, there has been push back on that feature -
https://lore.kernel.org/linux-xfs/20240923120715.GA13585@xxxxxx/
So we will try this PoC for the unaligned software-emulated fallback,
see how it looks - especially in terms of performance - and go from there.
Some other thoughts:
- I am not sure what atomic write unit max we would now use.
What statx exposes should be the size/alignment for hardware offload
to take place (i.e. no change)
The user could get that from statx on the block device on which we are
mounted, if that is not too inconvenient.
, regardless of what the filesystem
can do software offloads for. i.e. like statx->stx_blksize is the
"preferred block size for efficient IO", the atomic write unit
information is the "preferred atomic write size and alignment for
efficient IO", not the maximum sizes supported...
It's already documented that an atomic write which exceeds the unit max
will be rejected. I don't like the idea of relaxing the API to "an
atomic which exceeds unit max may be rejected". Indeed, in that case,
the user may still want to know the non-optimal unit max.
- Anything written back with CoW/exchange range will need FUA to ensure that
the write is fully persisted.
I don't think so. The journal commit for the exchange range
operation will issue a cache flush before the journal IO is
submitted. that will make the new data stable before the first
xchgrange transaction becomes stable.
Hence we get the correct data/metadata ordering on stable storage
simply by doing the exchange-range operation at data IO completion.
This the same data/metadata ordering semantics that unwritten extent
conversion is based on....
I am not sure if we will use exchange range, but we need such behavior
(in terms of persistence) described for whatever we do.
Thanks,
John