[please wrap your emails at 72 columns] On Mon, Dec 02, 2024 at 10:41:02AM +0000, Mitta Sai Chaithanya wrote: > Hi Team, > We are using XFS reflink feature to create snapshot of an > origin file (a thick file, created through fallocate) and exposing > origin file as a block device to the users. So it's basically a loop block device? All the questions you are asking are answered by studying how drivers/block/loop.c translates block device integrity requests to VFS operations on the backing file. The block device API has a mechanism for triggering integrity operations: REQ_PREFLUSH to flush volatile caches, and REQ_FUA to ask for a specific IO to be persisted to stable storage. The loop device translates REQ_PREFLUSH to vfs_fsync() on the backing file, and REQ_FUA is emulated by write/vfs_fsync_range(). I have a patch to natively support REQ_FUA by converting it to a RWF_DSYNC write call, which allows the underlying filesystem to convert that data integrity write to a REQ_FUA write w/ O_DIRECT... > XFS file was opened > with O_DIRECT option to avoid buffers at lower layer while > performing writes, even though a thick file is created, when user > performs writes then there are metadata changes associated to > writes (mostly xfs marks extents to know whether the data is > written to physical blocks or not). This is not specific to O_DIRECT - even buffered writes need to do post data-IO unwritten extent conversion on the first write to any file data. > To avoid metadata changes > during user writes we are explicitly zeroing entire file range > post creation of file, so that there won't be any metadata changes > in future for writes that happen on zeroed blocks. Which makes fallocate() redundant. Simply writing zeroes to an empty file will allocate and mark the extents as written. Run fsync at the end of the writes, and then there are no metadata updates other than timestamps for future overwrites. Until, of course .... > Now, if reflink copy of origin file is created then there > will be metadata changes which need to be persisted if data is > overwritten on the reflinked blocks of original file. .... you share the data extents between multiple inodes. Then every data write that needs to break extent sharing will trigger a COW that allocates new extents, hence requiring metadata modification both before the data IO is submitted and again after it is completed. > Even though > the file is opened in O_DIRECT mode changes to metadata do not > persist before write is acknowledged back to user, O_DIRECT by itself does not imply -any- data integrity nor any specific data/metadata ordering. Filesystems and block devices are free to treat O_DIRECT writes in any way they want w.r.t. caching, (lack of) crash resilience, etc. > if system > crashes when changes are in buffer then post recovery writes which > were acknowledged are not available to read. Well, yes. You need to combine O_DIRECT with O_DSYNC/O_SYNC/f[data]sync for it to have any meaning for the persistence of the data and metadata needed to retrieve the data being written. > Two options that we > were aware to avoid consistency issue is: > > 1. Adding O_SYNC flag while opening file which ensures each write > gets persisted in persistent media, but this leads to poor > performance. "Poor performance" == "exact capability of the storage hardware to persist data". i.e. performance of O_DIRECT writes with data integrity requirement is directly determined by the speed with which the storage device can persist the data. A filesystem like XFS will require two IOs to persist data written to a newly allocated extent, and they are dependent writes. We have no mechanism for telling block devices that they must order writes to persistent storage in a specific manner, so our only tool for ensuring that the block device orders the data and metadata IOs correctly is to issue a REQ_PREFLUSH between the two IOs. We do this with the journal IO, as it is the IO that requires all data writes to be persistent before we persist the metadata in the journal.... If you can use AIO/io_uring, then the latency of these dependent writes can be hidden as the process does not block waiting for two IOs to complete. It can process more IO submissions whilst the data integrity write is in flight. Then performance is not limited by synchronous data integrity IO latency.... > 2. Performing sync operation along with writes/post writes will > guarantees that metadata changes will be persisted. Yes, but that will only result in faster IO if the fdatasync calls are batched for multiple data IOs. i.e. for O_DIRECT, fdatasync() is effectively REQ_PREFLUSH|REQ_FUA journal write. If you are issuing one fdatasync per write, there is no benefit over O_DSYNC. And if you can batch O_DIRECT writes per fdatasync() call, then the block device has a volatile cache, and then the upper block device REQ_PREFLUSH and REQ_FUA operations need to be obeyed. IOWs, if you can amortise fdatasync across multiple IOs, then you may as well just advertise the device as having volatile caches and simply rely on the upper filesystem (i.e. whatever is on the block device) to issue data integrity flushes as appropriate.... i.e. this is the model the block loop device implements. > Are there any other option available to avoid the above > consistency issue (Without much degradation in performance)? There is little anyone can do to reduce the latency of individual IO completion to stable storage - single threaded, synchronous data integrity IO is always going to have a significant completion latency penalty. To mitigate this the storage stack and/or applications need to be architected to work in a way that isn't directly IO latency sensitive. As I said at the start, study the block loop device architecture and pay attention to how it implements REQ_PREFLUSH and the AIO+DIO backing file IO submission and completion signalling. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx