On Fri, Feb 09, 2024 at 12:47:38PM +0000, John Garry wrote: > > > > Why should we jump through crazy hoops to try to make filesystems > > > > optimised for large IOs with mismatched, overlapping small atomic > > > > writes? > > > > > > As mentioned, typically the atomic writes will be the same size, but we may > > > have other writes of smaller size. > > > > Then we need the tiny write to allocate and zero according to the > > maximum sized atomic write bounds. Then we just don't care about > > large atomic IO overlapping small IO, because the extent on disk > > aligned to the large atomic IO is then always guaranteed to be the > > correct size and shape. > > I think it's worth mentioning that there is currently a separation between > how we configure the FS extent size for atomic writes and what the bdev can > actually support in terms of atomic writes. And that's part of what is causing all the issues here - we're trying to jump though hoops at the fs level to handle cases that they device doesn't support and vice versa. > Setting the rtvol extsize at mkfs time or enabling atomic writes > FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do in > terms of atomic writes. Which is wrong. mkfs.xfs gets physical information about the volume from the kernel and makes the filesystem accounting to that information. That's how we do stripe alignment, sector sizing, etc. Atomic write support and setting up alignment constraints should be no different. Yes, mkfs allows the user to override the hardware configsi it probes, but it also warns when the override is doing something sub-optimal (like aligning all AG headers to the same disk in a stripe). IOWs, mkfs should be pulling this atomic write info from the hardware and configuring the filesysetm around that information. That's the target we should be aiming the kernel implementation at and optimising for - a filesystem that is correctly configured according to published hardware capability. Everything else is in the "make it behave correctly, but we don't care if it's slow" category. > This check is not done as it is not fixed what the bdev can do in terms of > atomic writes - or, more specifically, what they request_queue reports is > not be fixed. There are things which can change this. For example, a FW > update could change all the atomic write capabilities of a disk. Or even if > we swapped a SCSI disk into another host the atomic write limits may change, > as the atomic write unit max depends on the SCSI HBA DMA limits. Changing > BIO_MAX_VECS - which could come from a kernel update - could also change > what we report as atomic write limit in the request queue. If that sort of thing happens, then that's too bad. We already have these sorts of "do not do if you care about performance" constraints. e.g. don't do a RAID restripe that changes the alignment/size of the RAID device (e.g. add a single disk and make the stripe width wider) because the physical filesystem structure will no longer be aligned to the underlying hardware. instead, you have to grow striped volumes with compatible stripes in compatible sizes to ensure the filesystem remains aligned to the storage... We don't try to cater for every single possible permutation of storage hardware configurations - that way lies madness. Design and optimise for the common case of correctly configured and well behaved storage, and everything else we just don't care about beyond "don't corrupt or lose data". > > > > And therein lies the problem. > > > > > > > > If you are doing sub-rtextent IO at all, then you are forcing the > > > > filesystem down the path of explicitly using unwritten extents and > > > > requiring O_DSYNC direct IO to do journal flushes in IO completion > > > > context and then performance just goes down hill from them. > > > > > > > > The requirement for unwritten extents to track sub-rtextsize written > > > > regions is what you're trying to work around with XFS_BMAPI_ZERO so > > > > that atomic writes will always see "atomic write aligned" allocated > > > > regions. > > > > > > > > Do you see the problem here? You've explicitly told the filesystem > > > > that allocation is aligned to 64kB chunks, then because the > > > > filesystem block size is 4kB, it's allowed to track unwritten > > > > regions at 4kB boundaries. Then you do 4kB aligned file IO, which > > > > then changes unwritten extents at 4kB boundaries. Then you do a > > > > overlapping 16kB IO that*requires* 16kB allocation alignment, and > > > > things go BOOM. > > > > > > > > Yes, they should go BOOM. > > > > > > > > This is a horrible configuration - it is incomaptible with 16kB > > > > aligned and sized atomic IO. > > > > > > Just because the DB may do 16KB atomic writes most of the time should not > > > disallow it from any other form of writes. > > > > That's not what I said. I said the using sub-rtextsize atomic writes > > with single FSB unwritten extent tracking is horrible and > > incompatible with doing 16kB atomic writes. > > > > This setup will not work at all well with your patches and should go > > BOOM. Using XFS_BMAPI_ZERO is hacking around the fact that the setup > > has uncoordinated extent allocation and unwritten conversion > > granularity. > > > > That's the fundamental design problem with your approach - it allows > > unwritten conversion at *minimum IO sizes* and that does not work > > with atomic IOs with larger alignment requirements. > > > > The fundamental design principle is this: for maximally sized atomic > > writes to always succeed we require every allocation, zeroing and > > unwritten conversion operation to use alignments and sizes that are > > compatible with the maximum atomic write sizes being used. > > > > That sounds fine. > > My question then is how we determine this max atomic write size granularity. > > We don't explicitly tell the FS what atomic write size we want for a file. > Rather we mkfs with some extsize value which should match our atomic write > maximal value and then tell the FS we want to do atomic writes on a file, > and if this is accepted then we can query the atomic write min and max unit > size, and this would be [FS block size, min(bdev atomic write limit, > rtexsize)]. > > If rtextsize is 16KB, then we have a good idea that we want 16KB atomic > writes support. So then we could use rtextsize as this max atomic write > size. Maybe, but I think continuing to focus on this as 'atomic writes requires' is wrong. The filesystem does not carea bout atomic writes. What it cares about is the allocation constraints that need to be implemented. That constraint is that all BMBT extent operations need to be aligned to a specific extent size, not filesystem blocks. The current extent size hint (and rtextsize) only impact the -allocation of extents-. They are not directly placing constraints on the BMBT layout, they are placing constraints on the free space search that the allocator runs on the BNO/CNT btrees to select an extent that is then inserted into the BMBT. The problem is that unwritten extent conversion, truncate, hole punching, etc also all need to be correctly aligned for files that are configured to support atomic writes. These operations place constraints on how the BMBT can modify the existing extent list. These are different constraints to what rtextsize/extszhint apply, and that's the fundamental behavioural difference between existing extent size hint behaviour and the behaviour needed by atomic writes. > But I am not 100% sure that it your idea (apologies if I am wrong - I > am sincerely trying to follow your idea), but rather it would be > min(rtextsize, bdev atomic write limit), e.g. if rtextsize was 1MB and bdev > atomic write limit is 16KB, then there is no much point in dealing in 1MB > blocks for this unwritten extent conversion alignment. Exactly my point - there really is no relationship between rtextsize and atomic write constraints and that it is a mistake to use rtextsize as it stands as a placeholder for atomic write constraints. > If so, then my > concern is that the bdev atomic write upper limit is not fixed. This can > solved, but I would still like to be clear on this max atomic write size. Right, we haven't clearly defined how XFS is supposed align BMBT operations in a way that is compatible for atomic write operations. What the patchset does is try to extend and infer things from existing allocation alignment constraints, but then not apply those constraints to critical extent state operations (pure BMBT modifications) that atomic writes also need constrained to work correctly and efficiently. > > i.e. atomic writes need to use max write size granularity for all IO > > operations, not filesystem block granularity. > > > > And that also means things like rtextsize and extsize hints need to > > match these atomic write requirements, too.... > > As above, I am not 100% sure if you mean these to be the atomic write > maximal value. Yes, they either need to be the same as the atomic write max value or, much better, once a hint has been set, then resultant size is then aligned up to be compatible with the atomic write constraints. e.g. set an inode extent size hint of 960kB on a device with 128kB atomic write capability. If the inode has the atomic write flag set, then allocations need to round the extent size up from 960kB to 1MB so that the BMBT extent layout and alignment is compatible with 128kB atomic writes. At this point, zeroing, punching, unwritten extent conversion, etc also needs to be done on aligned 128kB ranges to be comaptible with atomic writes, rather than filesysetm block boundaries that would normally be used if just the extent size hint was set. This is effectively what the original "force align" inode flag bit did - it told the inode that all BMBT manipulations need to be extent size hint aligned, not just allocation. It's a generic flag that implements specific extent manipulation constraints that happen to be compatible with atomic writes when the right extent size hint is set. So at this point, I'm not sure that we should have an "atomic writes" flag in the inode. We need to tell BMBT modifications to behave in a particular way - forced alignment - not that atomic writes are being used in the filesystem.... At this point, the filesystem can do all the extent modification alignment stuff that atomic writes require without caring if the block device supports atomic writes or even if the application is using atomic writes. This means we can test the BMBT functionality in fstests without requiring hardware (or emulation) that supports atomic writes - all we need to do is set the forced align flag, an extent size hint and go run fsx on it... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx