On Fri, May 20, 2022 at 01:27:39PM +1000, Dave Chinner wrote: > > > * stx_offset_align_optimal: the alignment (in bytes) suggested for file > > > offsets and I/O segment lengths to get optimal performance. This > > > applies to both DIO and buffered I/O. It differs from stx_blocksize > > > in that stx_offset_align_optimal will contain the real optimum I/O > > > size, which may be a large value. In contrast, for compatibility > > > reasons stx_blocksize is the minimum size needed to avoid page cache > > > read/write/modify cycles, which may be much smaller than the optimum > > > I/O size. For more details about the motivation for this field, see > > > https://lore.kernel.org/r/20220210040304.GM59729@xxxxxxxxxxxxxxxxxxx > > > > Hmm. So I guess this is supposed to be the filesystem's best guess at > > the IO size that will minimize RMW cycles in the entire stack? i.e. if > > the user does not want RMW of pagecache pages, of file allocation units > > (if COW is enabled), of RAID stripes, or in the storage itself, then it > > should ensure that all IOs are aligned to this value? > > > > I guess that means for XFS it's effectively max(pagesize, i_blocksize, > > bdev io_opt, sb_width, and (pretend XFS can reflink the realtime volume) > > the rt extent size)? I didn't see a manpage update for statx(2) but > > that's mostly what I'm interested in. :) > > Yup, xfs_stat_blksize() should give a good idea of what we should > do. It will end up being pretty much that, except without the need > to a mount option to turn on the sunit/swidth return, and always > taking into consideration extent size hints rather than just doing > that for RT inodes... While working on the man-pages update, I'm having second thoughts about the stx_offset_align_optimal field. Does any filesystem other than XFS actually want stx_offset_align_optimal, when st[x]_blksize already exists? Many network filesystems, as well as tmpfs when hugepages are enabled, already report large (megabytes) sizes in st[x]_blksize. And all documentation I looked at (man pages for Linux, POSIX, FreeBSD, NetBSD, macOS) documents st_blksize as something like "the preferred blocksize for efficient I/O". It's never documented as being limited to PAGE_SIZE, which makes sense because it's not. So stx_offset_align_optimal seems redundant, and it is going to confuse application developers who will have to decide when to use st[x]_blksize and when to use stx_offset_align_optimal. Also, applications that don't work well with huge reported optimal I/O sizes would still continue to exist, as it will remain possible for applications to only be tested on filesystems that report a small optimal I/O size. Perhaps for now we should just add STATX_DIOALIGN instead of STATX_IOALIGN, leaving out the stx_offset_align_optimal field? What do people think? - Eric