On 14/01/2025 23:57, Darrick J. Wong wrote:
i.e. RWF_ATOMIC as implemented by a COW capable filesystem should
always be able to succeed regardless of IO alignment. In these
situations, the REQ_ATOMIC block layer offload to the hardware is a
fast path that is enabled when the user IO and filesystem extent
alignment matches the constraints needed to do a hardware atomic
write.
In all other cases, we implement RWF_ATOMIC something like
always-cow or prealloc-beyond-eof-then-xchg-range-on-io-completion
for anything that doesn't correctly align to hardware REQ_ATOMIC.
That said, there is nothing that prevents us from first implementing
RWF_ATOMIC constraints as "must match hardware requirements exactly"
and then relaxing them to be less stringent as filesystems
implementations improve. We've relaxed the direct IO hardware
alignment constraints multiple times over the years, so there's
nothing that really prevents us from doing so with RWF_ATOMIC,
either. Especially as we have statx to tell the application exactly
what alignment will get fast hardware offloads...
Ok, let's do that then. Just to be clear -- for any RWF_ATOMIC direct
write that's correctly aligned and targets a single mapping in the
correct state, we can build the untorn bio and submit it. For
everything else, prealloc some post EOF blocks, write them there, and
exchange-range them.
I have some doubt about this, but I may be misunderstanding the concept:
So is there any guarantee that what we write into is aligned (after the
exchange-range routine)? If not, surely every subsequent write with
RWF_ATOMIC to that logical range will require this exchange-range
routine until we get something aligned (and correct granularity) - correct?
I know that getting unaligned blocks continuously is unlikely, unless a
heavily fragmented disk. However, databases prefer guaranteed
performance (which HW offload gives).
We can use extszhint to hint at granularity, but that does not help with
alignment (AFAIK).
Tricky questions: How do we avoid collisions between overlapping writes?
I guess we find a free file range at the top of the file that is long
enough to stage the write, and put it there? And purge it later?
Also, does this imply that the maximum file size is less than the usual
8EB?
(There's also the question about how to do this with buffered writes,
but I guess we could skip that for now.)