On Fri, Mar 28, 2014 at 09:16:02AM -0400, Brian Foster wrote: > Speculative preallocation and and the associated throttling metrics > assume we're working with large files on large filesystems. Users have > reported inefficiencies in these mechanisms when we happen to be dealing > with large files on smaller filesystems. This can occur because while > prealloc throttling is aggressive under low free space conditions, it is > not active until we reach 5% free space or less. > > For example, a 40GB filesystem has enough space for several files large > enough to have multi-GB preallocations at any given time. If those files > are slow growing, they might reserve preallocation for long periods of > time as well as avoid the background scanner due to frequent > modification. If a new file is written under these conditions, said file > has no access to this already reserved space and premature ENOSPC is > imminent. > > To handle this scenario, modify the buffered write ENOSPC handling and > retry sequence to invoke an eofblocks scan. The eofblocks scan is > attempted prior to the inode flush as it is lighter weight and the > former is a last resort to free space. In the smaller filesystem > scenario, the eofblocks scan resets the usage of preallocation such that > when the 5% free space threshold is met, throttling effectively takes > over to provide fair and efficient preallocation until legitimate > ENOSPC. > > The eofblocks scan is selective based on the nature of the failure. For > example, an EDQUOT failure in a particular quota will use a filtered > scan for that quota. Because we don't know which quota might have caused > an allocation failure at any given time, we run a scan against each > applicable quota determined to be under low free space conditions. > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > --- > fs/xfs/xfs_dquot.h | 15 ++++++++++++++ > fs/xfs/xfs_file.c | 32 ++++++++++++++++++++++++++--- > fs/xfs/xfs_icache.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > fs/xfs/xfs_icache.h | 1 + > 4 files changed, 104 insertions(+), 3 deletions(-) > > diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h > index d22ed00..899b99f 100644 > --- a/fs/xfs/xfs_dquot.h > +++ b/fs/xfs/xfs_dquot.h > @@ -141,6 +141,21 @@ static inline xfs_dquot_t *xfs_inode_dquot(struct xfs_inode *ip, int type) > } > } > > +/* > + * Check whether a dquot is under low free space conditions. We assume the quota > + * is enabled and enforced. > + */ > +static inline bool xfs_dquot_lowsp(struct xfs_dquot *dqp) > +{ > + int64_t freesp; > + > + freesp = be64_to_cpu(dqp->q_core.d_blk_hardlimit) - dqp->q_res_bcount; > + if (freesp < dqp->q_low_space[XFS_QLOWSP_1_PCNT]) > + return true; > + > + return false; > +} > + > #define XFS_DQ_IS_LOCKED(dqp) (mutex_is_locked(&((dqp)->q_qlock))) > #define XFS_DQ_IS_DIRTY(dqp) ((dqp)->dq_flags & XFS_DQ_DIRTY) > #define XFS_QM_ISUDQ(dqp) ((dqp)->dq_flags & XFS_DQ_USER) > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 2e7989e..1ca16ce 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -38,6 +38,7 @@ > #include "xfs_trace.h" > #include "xfs_log.h" > #include "xfs_dinode.h" > +#include "xfs_icache.h" > > #include <linux/aio.h> > #include <linux/dcache.h> > @@ -723,6 +724,7 @@ xfs_file_buffered_aio_write( > struct xfs_inode *ip = XFS_I(inode); > ssize_t ret; > int enospc = 0; > + int scanned = 0; > int iolock = XFS_IOLOCK_EXCL; > size_t count = ocount; > > @@ -741,10 +743,34 @@ write_retry: > pos, &iocb->ki_pos, count, 0); > > /* > - * If we just got an ENOSPC, try to write back all dirty inodes to > - * convert delalloc space to free up some of the excess reserved > - * metadata space. > + * If we hit ENOSPC or a quota limit, use the selective nature of the > + * eofblocks scan to try and free up some lingering speculative > + * preallocation delalloc blocks. > + * > + * If we hit a quota limit, only scan for files covered by the quota. We > + * also consider ENOSPC here because project quota failure can return > + * ENOSPC instead of EDQUOT. The quota scanning only sets 'scanned' if > + * the inode is covered by a quota with low free space. This should > + * minimize interference with global ENOSPC handling. > + * > + * If a scan does not free enough space, resort to the inode flush big > + * hammer to convert delalloc space to free up some of the excess > + * reserved metadata space. > */ > + if ((ret == -EDQUOT || ret == -ENOSPC) && !scanned) { > + scanned = xfs_inode_free_quota_eofblocks(ip); > + if (scanned) > + goto write_retry; > + } > + if (ret == -ENOSPC && !scanned) { > + struct xfs_eofblocks eofb = {0,}; IIRC, you can just use "{ 0 }" for initialisation, no "," needed. > + > + eofb.eof_scan_owner = ip->i_ino; /* for locking */ > + eofb.eof_flags = XFS_EOF_FLAGS_SYNC | XFS_EOF_FLAGS_FLUSH; > + xfs_icache_free_eofblocks(ip->i_mount, &eofb); > + scanned = 1; > + goto write_retry; > + } > if (ret == -ENOSPC && !enospc) { > enospc = 1; > xfs_flush_inodes(ip->i_mount); This seems overly complex and fragile. I'd much prefer that we don't bury data writeback deep in the EOF block freeing code - we've done a lot of work in the past to remove exactly that sort of behaviour from XFS inode scanners. I'd prefer to see something like this: if (ret == -EDQUOT && !enospc) { enospc = 1; xfs_inode_free_quota_eofblocks(ip); goto retry; else if (ret == -ENOSPC && !enospc) { enospc = 1; xfs_flush_inodes(ip->i_mount); .... xfs_icache_free_eofblocks(ip->i_mount, &eofb); goto retry; } This way has specific behaviours for EDQUOT vs ENOSPC, and we treat them appropriately with a minimum of differences. And ENOSPC is global, because we can't tell the difference here between a project quota ENOSPC and a global ENOSPC at this point. > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index bd0ab7d..471ccfa 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -33,6 +33,9 @@ > #include "xfs_trace.h" > #include "xfs_icache.h" > #include "xfs_bmap_util.h" > +#include "xfs_quota.h" > +#include "xfs_dquot_item.h" > +#include "xfs_dquot.h" > > #include <linux/kthread.h> > #include <linux/freezer.h> > @@ -1277,6 +1280,62 @@ xfs_icache_free_eofblocks( > eofb, XFS_ICI_EOFBLOCKS_TAG); > } > > +/* > + * Run eofblocks scans on the quotas applicable to the inode. For inodes with > + * multiple quotas, we don't know exactly which quota caused an allocation > + * failure. We make a best effort by running scans for each quota considered > + * to be under low free space conditions (less than 1% available free space). > + */ > +int > +xfs_inode_free_quota_eofblocks( > + struct xfs_inode *ip) > +{ > + int scanned = 0; > + struct xfs_eofblocks eofb = {0,}; > + struct xfs_dquot *dq; > + > + ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL)); > + > + /* set the scan owner to avoid potential livelock */ > + eofb.eof_scan_owner = ip->i_ino; > + > + if (XFS_IS_UQUOTA_ENFORCED(ip->i_mount)) { > + dq = xfs_inode_dquot(ip, XFS_DQ_USER); > + if (dq && xfs_dquot_lowsp(dq)) { > + eofb.eof_uid = VFS_I(ip)->i_uid; > + eofb.eof_flags = XFS_EOF_FLAGS_SYNC| whitespace. > + XFS_EOF_FLAGS_UID; > + xfs_icache_free_eofblocks(ip->i_mount, &eofb); > + scanned = 1; > + } > + } > + > + if (XFS_IS_GQUOTA_ENFORCED(ip->i_mount)) { > + dq = xfs_inode_dquot(ip, XFS_DQ_GROUP); > + if (dq && xfs_dquot_lowsp(dq)) { > + eofb.eof_gid = VFS_I(ip)->i_gid; > + eofb.eof_flags = XFS_EOF_FLAGS_SYNC| > + XFS_EOF_FLAGS_GID; > + xfs_icache_free_eofblocks(ip->i_mount, &eofb); > + scanned = 1; > + } > + } > + > + if (XFS_IS_PQUOTA_ENFORCED(ip->i_mount)) { > + dq = xfs_inode_dquot(ip, XFS_DQ_PROJ); > + if (dq && xfs_dquot_lowsp(dq)) { > + eofb.eof_prid = xfs_get_projid(ip); > + eofb.eof_flags = XFS_EOF_FLAGS_SYNC| > + XFS_EOF_FLAGS_PRID| > + XFS_EOF_FLAGS_FLUSH; > + xfs_icache_free_eofblocks(ip->i_mount, &eofb); > + scanned = 1; > + } > + } I really don't like the fact that project quota is hiding a data flush in the "free_quota_eofblocks" logic. It just strikes me a the wrong thing to do because if it's a real ENOSPC we're just going to have to do this anyway... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs