Hi folks, This is the XFS implementation on the sub-block DIO optimisations for written extents that I've mentioned on #xfs and a couple of times now on the XFS mailing list. It takes the approach of using the IOMAP_NOWAIT non-blocking IO submission infrastructure to optimistically dispatch sub-block DIO without exclusive locking. If the extent mapping callback decides that it can't do the unaligned IO without extent manipulation, sub-block zeroing, blocking or splitting the IO into multiple parts, it aborts the IO with -EAGAIN. This allows the high level filesystem code to then take exclusive locks and resubmit the IO once it has guaranteed no other IO is in progress on the inode (the current implementation). This requires moving the IOMAP_NOWAIT setup decisions up into the filesystem, adding yet another parameter to iomap_dio_rw(). So first I convert iomap_dio_rw() to take an args structure so that we don't have to modify the API every time we want to add another setup parameter to the DIO submission code. I then include Christophs IOCB_NOWAIT fxies and cleanups to the XFS code, because they needed to be done regardless of the unaligned DIO issues and they make the changes simpler. Then I split the unaligned DIO path out from the aligned path, because all the extra complexity to support better unaligned DIO submission concurrency is not necessary for the block aligned path. Finally, I modify the unaligned IO path to first submit the unaligned IO using non-blocking semantics and provide a fallback to run the IO exclusively if that fails. This means that we consider sub-block dio into written a fast path that should almost always succeed with minimal overhead and we put all the overhead of failure into the slow path where exclusive locking is required. Unlike Christoph's proposed patch, this means we don't require an extra ILOCK cycle in the sub-block DIO setup fast path, so it should perform almost identically to the block aligned fast path. Tested using fio with AIO+DIO randrw to a written file. Performance increases from about 20k IOPS to 150k IOPS, which is the limit of the setup I was using for testing. Also passed fstests auto group on a both v4 and v5 XFS filesystems. Thoughts, comments? -Dave.