Re: [PATCH 17/21] fs: xfs: iomap atomic write support

Ming Lei <ming.lei@xxxxxxxxxx> · Tue, 5 Dec 2023 21:59:08 +0800

On Mon, Dec 04, 2023 at 03:19:15PM +0000, John Garry wrote:
> On 04/12/2023 13:45, Christoph Hellwig wrote:
> > On Tue, Nov 28, 2023 at 05:42:10PM +0000, John Garry wrote:
> > > ok, fine, it would not be required for XFS with CoW. Some concerns still:
> > > a. device atomic write boundary, if any
> > > b. other FSes which do not have CoW support. ext4 is already being used for
> > > "atomic writes" in the field - see dubious amazon torn-write prevention.
> > 
> > What is the 'dubious amazon torn-write prevention'?
> 
> https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html
> 
> AFAICS, this is without any kernel changes, so no guarantee of unwanted
> splitting or merging of bios.
> 
> Anyway, there will still be !CoW FSes which people want to support.
> 
> > 
> > > About b., we could add the pow-of-2 and file offset alignment requirement
> > > for other FSes, but then need to add some method to advertise that
> > > restriction.
> > 
> > We really need a better way to communicate I/O limitations anyway.
> > Something like XFS_IOC_DIOINFO on steroids.
> > 
> > > Sure, but to me it is a concern that we have 2x paths to make robust a.
> > > offload via hw, which may involve CoW b. no HW support, i.e. CoW always
> > 
> > Relying just on the hardware seems very limited, especially as there is
> > plenty of hardware that won't guarantee anything larger than 4k, and
> > plenty of NVMe hardware without has some other small limit like 32k
> > because it doesn't support multiple atomicy mode.
> 
> So what would you propose as the next step? Would it to be first achieve
> atomic write support for XFS with HW support + CoW to ensure contiguous
> extents (and without XFS forcealign)?
> 
> > 
> > > And for no HW support, if we don't follow the O_ATOMIC model of committing
> > > nothing until a SYNC is issued, would we allocate, write, and later free a
> > > new extent for each write, right?
> > 
> > Yes. Then again if you do data journalling you do that anyway, and as
> > one little project I'm doing right now shows that data journling is
> > often the fastest thing we can do for very small writes.
> 
> Ignoring FSes, then how is this supposed to work for block devices? We just
> always need HW support, right?

Looks the HW support could be minimized, just like what Google and Amazon did,
16KB physical block size with proper queue limit setting.

Now seems it is easy to make such device with ublk-loop by:

- use one backing disk with 16KB/32KB/.. physical block size
- expose proper physical bs & chunk_sectors & max sectors queue limit

Then any 16KB aligned direct WRITE with N*16KB length(N in [1, 8] with 256
chunk_sectors) can be atomic-write.

Thanks,
Ming