Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support

John Garry <john.g.garry@xxxxxxxxxx> · Wed, 14 Feb 2024 11:06:11 +0000

Setting the rtvol extsize at mkfs time or enabling atomic writes
FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do in
terms of atomic writes.

Which is wrong. mkfs.xfs gets physical information about the volume
from the kernel and makes the filesystem accounting to that
information. That's how we do stripe alignment, sector sizing, etc.
Atomic write support and setting up alignment constraints should be
no different.

Yes, I was just looking at adding a mkfs option to format for atomic 
writes, which would check physical information about the volume and 
whether it suits rtextsize and then subsequently also set 
XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES.

Yes, mkfs allows the user to override the hardware configsi it
probes, but it also warns when the override is doing something
sub-optimal (like aligning all AG headers to the same disk in a
stripe).

IOWs, mkfs should be pulling this atomic write info from the
hardware and configuring the filesysetm around that information.
That's the target we should be aiming the kernel implementation at
and optimising for - a filesystem that is correctly configured
according to published hardware capability.

Right

So, for example, if the atomic writes option is set and the rtextsize 
set by the user is so much larger than what HW can support in terms of 
atomic writes, then we should let the user know about this.

Everything else is in the "make it behave correctly, but we don't
care if it's slow" category.

This check is not done as it is not fixed what the bdev can do in terms of
atomic writes - or, more specifically, what they request_queue reports is
not be fixed. There are things which can change this. For example, a FW
update could change all the atomic write capabilities of a disk. Or even if
we swapped a SCSI disk into another host the atomic write limits may change,
as the atomic write unit max depends on the SCSI HBA DMA limits. Changing
BIO_MAX_VECS - which could come from a kernel update - could also change
what we report as atomic write limit in the request queue.

If that sort of thing happens, then that's too bad. We already have
these sorts of "do not do if you care about performance"
constraints. e.g. don't do a RAID restripe that changes the
alignment/size of the RAID device (e.g. add a single disk and make
the stripe width wider) because the physical filesystem structure
will no longer be aligned to the underlying hardware. instead, you
have to grow striped volumes with compatible stripes in compatible
sizes to ensure the filesystem remains aligned to the storage...

We don't try to cater for every single possible permutation of
storage hardware configurations - that way lies madness. Design and
optimise for the common case of correctly configured and well
behaved storage, and everything else we just don't care about beyond
"don't corrupt or lose data".

ok

And therein lies the problem.

...

That sounds fine.

My question then is how we determine this max atomic write size granularity.

We don't explicitly tell the FS what atomic write size we want for a file.
Rather we mkfs with some extsize value which should match our atomic write
maximal value and then tell the FS we want to do atomic writes on a file,
and if this is accepted then we can query the atomic write min and max unit
size, and this would be [FS block size, min(bdev atomic write limit,
rtexsize)].

If rtextsize is 16KB, then we have a good idea that we want 16KB atomic
writes support. So then we could use rtextsize as this max atomic write
size.

Maybe, but I think continuing to focus on this as 'atomic writes
requires' is wrong.

The filesystem does not carea bout atomic writes. What it cares
about is the allocation constraints that need to be implemented.
That constraint is that all BMBT extent operations need to be
aligned to a specific extent size, not filesystem blocks.

The current extent size hint (and rtextsize) only impact the
-allocation of extents-. They are not directly placing constraints
on the BMBT layout, they are placing constraints on the free space
search that the allocator runs on the BNO/CNT btrees to select an
extent that is then inserted into the BMBT.

The problem is that unwritten extent conversion, truncate, hole
punching, etc also all need to be correctly aligned for files that
are configured to support atomic writes. These operations place
constraints on how the BMBT can modify the existing extent list.

These are different constraints to what rtextsize/extszhint apply,
and that's the fundamental behavioural difference between existing
extent size hint behaviour and the behaviour needed by atomic
writes.

But I am not 100% sure that it your idea (apologies if I am wrong - I
am sincerely trying to follow your idea), but rather it would be
min(rtextsize, bdev atomic write limit), e.g. if rtextsize was 1MB and bdev
atomic write limit is 16KB, then there is no much point in dealing in 1MB
blocks for this unwritten extent conversion alignment.

Exactly my point - there really is no relationship between rtextsize
and atomic write constraints and that it is a mistake to use
rtextsize as it stands as a placeholder for atomic write
constraints.

ok

If so, then my
concern is that the bdev atomic write upper limit is not fixed. This can
solved, but I would still like to be clear on this max atomic write size.

Right, we haven't clearly defined how XFS is supposed align BMBT
operations in a way that is compatible for atomic write operations.

What the patchset does is try to extend and infer things from
existing allocation alignment constraints, but then not apply those
constraints to critical extent state operations (pure BMBT
modifications) that atomic writes also need constrained to work
correctly and efficiently.

Right. Previously I also did mention that we could explicitly request 
the atomic write size per-inode, but a drawback is that this would 
require an on-disk format change.

i.e. atomic writes need to use max write size granularity for all IO
operations, not filesystem block granularity.

And that also means things like rtextsize and extsize hints need to
match these atomic write requirements, too....

As above, I am not 100% sure if you mean these to be the atomic write
maximal value.

Yes, they either need to be the same as the atomic write max value
or, much better, once a hint has been set, then resultant size is
then aligned up to be compatible with the atomic write constraints.

e.g. set an inode extent size hint of 960kB on a device with 128kB
atomic write capability. If the inode has the atomic write flag set,
then allocations need to round the extent size up from 960kB to 1MB
so that the BMBT extent layout and alignment is compatible with 128kB
atomic writes.

At this point, zeroing, punching, unwritten extent conversion, etc
also needs to be done on aligned 128kB ranges to be comaptible with
atomic writes, rather than filesysetm block boundaries that would
normally be used if just the extent size hint was set.

This is effectively what the original "force align" inode flag bit
did - it told the inode that all BMBT manipulations need to be
extent size hint aligned, not just allocation. It's a generic flag
that implements specific extent manipulation constraints that happen
to be compatible with atomic writes when the right extent size hint
is set.

So at this point, I'm not sure that we should have an "atomic
writes" flag in the inode. 

Another motivation for this flag is that we can explicitly enable some 
software-based atomic write support for an inode when the backing device 
does not have HW support.

In addition, in this series setting FS_XFLAG_ATOMICWRITES means 
XFS_DIFLAG2_ATOMICWRITES gets set, and I would expect it to do something 
similar for other OSes, and for those other OSes it may also mean some 
other special alignment feature enabled. We want a consistent user 
experience.

We need to tell BMBT modifications
to behave in a particular way - forced alignment - not that atomic
writes are being used in the filesystem....

At this point, the filesystem can do all the extent modification
alignment stuff that atomic writes require without caring if the
block device supports atomic writes or even if the
application is using atomic writes.

This means we can test the BMBT functionality in fstests without
requiring hardware (or emulation) that supports atomic writes - all
we need to do is set the forced align flag, an extent size hint and
go run fsx on it...

The current idea was that the forcealign feature would be required for 
the second phase for atomic write support - non-rtvol support. Darrick 
did send that series out separately over the New Year's break.

I think that you wanted to progress the following series first:
https://lore.kernel.org/linux-xfs/20231004001943.349265-1-david@xxxxxxxxxxxxx/

Right?

Thanks,
John