On Wed, Jan 15, 2014 at 11:59:45AM -0600, Eric Sandeen wrote: > Some time ago, mkfs.xfs started picking the storage physical > sector size as the default filesystem "sector size" in order > to avoid RMW costs incurred by doing IOs at logical sector > size alignments. > > However, this means that for a filesystem made with i.e. > a 4k sector size on an "advanced format" 4k/512 disk, > 512-byte direct IOs are no longer allowed. This means > that XFS has essentially turned this AF drive into a hard > 4K device, from the filesystem on up. > > XFS's mkfs-specified "sector size" is really just controlling > the minimum size & alignment of filesystem metadata IO. > > There is no real need to tightly couple XFS's minimal > metadata size to the minimum allowed direct IO size; > XFS can continue doing metadata in optimal sizes, but > still allow smaller DIOs for apps which issue them, > for whatever reason. Given that we already serialise sub-block IO completely, doing "sub-sector" IO will also be serialised so there shouldn't be any new issues introduced by this change. > This patch adds 2 new fields to the xfs_buftarg, so that > we now track 3 sizes: > > 1) The device logical sector size > 2) The device physical sector size > 3) The filesystem sector size, which is the minimum unit and > alignment of IO which will be performed by metadata operations. I wouldn't call it the "filesystem sector size" because it's clear that it doesn't apply to everything in the filesystem. I'd prefer that we call it the "metadata sector size", similar to the "log sector size" we keep for situations where the external log device has a different sector size to the data device. > The first is used for the minimum allowed direct IO alignment, > the 2nd is used to report DIO sizes in XFS_IOC_DIOINFO > (the theory being, if an app actually asks, we can give them > the optimal answer, even if we allow smaller IOs), and the > 3rd is used internally by the filesystem for metadata IOs. Terminology nit: the 3rd is set by mkfs to define the physical format constraints that metadata in the filesystem must obey. If we ever have to allocate sector sized metadata, then it will be used for more than just IO.... > This has passed xfstests on filesystems made with 4k sectors, > including when run under the patch I sent to ignore > XFS_IOC_DIOINFO, and issue 512 DIOs anyway. I also directly > tested end of block behavior on preallocated, sparse, and > existing files when we do a 512 IO into a 4k file on a > 4k-sector filesystem, to be sure there were no unexpected > behaviors. > > Signed-off-by: Eric Sandeen <sandeen@xxxxxxxxxx> > --- > > NB: This depends on this patch which is in the xfs tree, > but not yet upstream: > xfs: simplify xfs_setsize_buftarg callchain; remove unused arg > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index 9fccfb5..a89dcdf 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -1599,6 +1599,7 @@ xfs_setsize_buftarg( > unsigned int blocksize, > unsigned int sectorsize) > { > + /* Set up filesystem block and sector sizes */ > btp->bt_bsize = blocksize; > btp->bt_sshift = ffs(sectorsize) - 1; The two places that use bt_sshift convert it back to a byte count. We should change this simply to being a byte count. > btp->bt_smask = sectorsize - 1; > @@ -1614,6 +1615,9 @@ xfs_setsize_buftarg( > return EINVAL; > } > > + /* Set up device logical & physical sector size info */ > + btp->bt_lsmask = bdev_logical_block_size(btp->bt_bdev) - 1; > + btp->bt_pssize = bdev_physical_block_size(btp->bt_bdev); > return 0; > } > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h > index 1cf21a4..29a0db9 100644 > --- a/fs/xfs/xfs_buf.h > +++ b/fs/xfs/xfs_buf.h > @@ -88,14 +88,28 @@ typedef unsigned int xfs_buf_flags_t; > */ > #define XFS_BSTATE_DISPOSE (1 << 0) /* buffer being discarded */ > > +/* > + * The xfs_buftarg contains 3 notions of "sector size" - > + * > + * 1) The device logical sector size > + * 2) The device physical sector size > + * 3) The filesystem sector size, which is the minimum unit and > + * alignment of IO which will be performed by metadata operations. > + * > + * The latter is specified at mkfs time, stored on-disk in the > + * superblock's sb_sectsize, and is set from there. > + */ > + > typedef struct xfs_buftarg { > dev_t bt_dev; > struct block_device *bt_bdev; > struct backing_dev_info *bt_bdi; > struct xfs_mount *bt_mount; > - unsigned int bt_bsize; > - unsigned int bt_sshift; > - size_t bt_smask; > + unsigned int bt_bsize; /* fs block size */ > + unsigned int bt_sshift; /* fs sector size shift */ > + size_t bt_smask; /* fs sector size mask */ > + size_t bt_lsmask; /* dev logical sectsz mask */ > + unsigned int bt_pssize; /* dev physical sector size */ This patch makes bt_smask unused, so it can be removed. bt_bsize is also unused, so that should be removed, too. indeed, the buftarg has a backpointer to the xfs_mount, so we can get the block size from there if it is ever necessary. And I think we should make these names a little more descriptive while we are touching them so comments to describe them are unnecessary: unsigned int bt_meta_sectorsize; unsigned int bt_physical_sectorsize; size_t bt_logical_sectormask; > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c > index 33ad9a7..1f3431f 100644 > --- a/fs/xfs/xfs_ioctl.c > +++ b/fs/xfs/xfs_ioctl.c > @@ -1587,7 +1587,12 @@ xfs_file_ioctl( > XFS_IS_REALTIME_INODE(ip) ? > mp->m_rtdev_targp : mp->m_ddev_targp; > > - da.d_mem = da.d_miniosz = 1 << target->bt_sshift; > + /* > + * Report device physical sector size as "optimal" minimum, > + * unless blocksize is smaller than that. > + */ > + da.d_miniosz = min(target->bt_pssize, target->bt_bsize); Just grab the filesysetm block size from the xfs_mount: da.d_miniosz = min(target->bt_pssize, mp->m_sb.sb_blocksize); > + da.d_mem = da.d_miniosz; I'd suggest that this should be PAGE_SIZE - it's for memory buffer alignment, not IO alignment, so using the IO alignment just seems wrong to me... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs