Very helpful, allow me to provide some concrete scenarios. SQL Server has data and log files and we honor WAL to provide for proper crash recovery and durability of the database. * http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/sqlIObasics.mspx * http://download.microsoft.com/download/4/f/8/4f8f2dc9-a9a7-4b68-98cb-163482c95e0b/SQLIOBasicsCh2.doc --------------------------------------------------------------------------------------------- #1 Scenario - Write vs Flush and io_getevents Ordering --------------------------------------------------------------------------------------------- If SQL Server opens data and log files with O_DIRECT | O_DSYNC 1. T1 - SQL Server issues io_submit (IOCB_CMD_WRITE) for data file Page xfs_write_file_iter issues xfs_file_dio_aio_write (W1) As you pointed out the return is -EIOCBQUEUED (from the block device direct I/O queue insert when !sync) so T1 can continue forward and would skip the generic_write_sync in xfs_write_file_iter 2. T2 - SQL Server is waiting on io_getevents for the I/O (W1) posted by T1 via the same async kernel I/O context When the W1 is considered complete SQL Server can truncate/reuse our database log file. I need to validate that the kernel only returns from io_getevents for W1 processed by hardware before F1 was issued and completed, then SQL Server can safely truncate the database transaction log. The code shows iomap_dio_complete_work that I expected to issue the REQ_PREFLUSH for the O_DSYNC based write and prevent the completion from getting to the io_getevents caller. The patch moved this to the iomap_dio_complete which I believe meets my needs. I think the answer is the iomap* logic makes sure to issue generic_write_sync for O_DSYNC W1 after W1 is completed by hardware and then waits for completion of the flush request (REQ_PREFLUSH) before W1 is returned to the AIO completion ring, preventing io_getevents from processing W1 before the flush occurs and completes. I just need proper confirmation from the experts on this code that this is the expected behavior. --------------------------------------------------------------------------------------------- #2 Dynamic determination of performant FUA capabilities --------------------------------------------------------------------------------------------- For SQL Server using O_DIRECT | O_DSYNC on current kernels is very performance impacting. Instead we enable a mode for SQL that opens O_DIRECT only and issues fsync/fdatasync when we are hardening log files or checkpointing data files. This reduces the write, flush, write, flush pattern allowing for write, write, write, ... then flush as we only issue flush requests when required to maintain the data integrity boundaries of SQL Server. The performance is significantly better then the device flush for each write as you can imagine. Testing shows the FUA enhancement is better then the write, flush pattern. For SQL Server we want to dynamically open with O_DIRECT | O_DSYNC when REQ_FUA can be properly used and open with O_DIRECT and leverage SQL Server's alternate flush scheme when running on older kernel or a system that does not support FUA (SATA, IDE, ...) -----Original Message----- From: Dave Chinner <david@xxxxxxxxxxxxx> Sent: Tuesday, March 13, 2018 12:11 AM To: Robert Dorr <rdorr@xxxxxxxxxxxxx> Cc: Dan Williams <dan.j.williams@xxxxxxxxx>; linux-xfs@xxxxxxxxxxxxxxx; Christoph Hellwig <hch@xxxxxx>; linux-fsdevel <linux-fsdevel@xxxxxxxxxxxxxxx>; Jan Kara <jack@xxxxxxx>; Theodore Ts'o <tytso@xxxxxxx>; Matthew Wilcox <mawilcox@xxxxxxxxxxxxx>; Scott Konersmann <scottkon@xxxxxxxxxxxxx>; Slava Oks <slavao@xxxxxxxxxxxxx>; Jasraj Dange <jasrajd@xxxxxxxxxxxxx>; Michael Nelson <micn@xxxxxxxxxxxxx> Subject: Re: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO writes [This sender failed our fraud detection checks and may not be who they appear to be. Learn about spoofing at http://aka.ms/LearnAboutSpoofing] On Tue, Mar 13, 2018 at 12:15:28AM +0000, Robert Dorr wrote: > Hello all. I have a couple of follow-up questions around this effort, > thanks you all for all your kind inputs, patience and knowledge > transfer. > > 1. How does xfs or ext4 make sure a pattern of WS followed by > FWS does not allow the write (WS) completion to be visible before the > flush completes? I'm not sure what you are asking here. You need to be more precise about what these IOs are, who dispatched them and what their dependencies are. Where exactly did that FWS (REQ_FLUSH) come from? I think, though, you're asking questions about IO ordering at the wrong level - filesystems serialise and order IO, not the block layer. Hence what you see at the block layer is not necessary a reflection of the ordering the filesystem is doing. (I've already explained this earlier today in a different thread: https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmarc.info%2F%3Fl%3Dlinux-xfs%26m%3D152091489100831%26w%3D2&data=04%7C01%7Crdorr%40microsoft.com%7C19bb64450eba4b0f2c8308d588a0c4e1%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636565146506974722%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C100&sdata=o0CgiPt1dsfALFS5fXHRGy2jlD1t4HIBjjqD%2BqOKnb4%3D&reserved=0) That's why I ask about the operation causing a REQ_FLUSH to be issued to the storage device, as that cannot be directly issued from userspace. It will occur as a side effect of a data integrity operation the filesystem is asked to perform, but without knowing the relationship between the integrity operation and the write in question an answer cannot be given. It may would be better to describe your IO ordering and integrity requirements at a higher level (e.g. the syscall layer), because then we know what you are trying to acheive rather that trying to understand your problem from context-less questions about "IO barriers" that don't actually exist... > I suspected the write was held in > iomap_dio_complete_work but with the generic_write_sync change in the > patch would a O_DSYNC write request to a DpoFua=0 block queue allow T2 > to see the completion via io_getevents before T1 completed the actual > flush? Yes, that can happen as concurrent data direct IO are not serialised against each other and will always race to completion without providing any ordering guarantees. IOWs, if you have an IO ordering dependency in your application, then that ordering dependency needs to be handled in the application. > 2. How will my application be able to dynamically determine if > xfs and ext4 have the performance enhancement for FUA or I need engage > alternate methods to use fsync/fdatasync at strategic locations? You don't. The filesystem will provide the same integrity guarantees in either case - FUA is just a performance optimisation that will get used if your hardware supports it. Applications should not care what capabilities the storage hardawre has - the kernel should do what is fastest and most reliable for the underlying storage.... > 3. Are there any plans yet to optimize ext4 as well? Not from me. > 4. Before the patched code the xfs_file_write_iter would call > generic_write_sync and that calls submit_io_wait. Does this hold the > thread issuing the io_submit so it is unable to drive more async I/O? No, -EIOCBQUEUED is returned to avoid blocking. AIO calls generic_write_sync() from the IO completion path via a worker thread so it's all done asynchronously. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx