On Fri, Jan 26, 2018 at 08:06:25AM -0500, Brian Foster wrote: > On Thu, Jan 25, 2018 at 12:20:33PM -0800, Darrick J. Wong wrote: > > On Thu, Jan 25, 2018 at 12:31:12PM -0500, Brian Foster wrote: > > > On Tue, Jan 23, 2018 at 06:18:35PM -0800, Darrick J. Wong wrote: > > > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > > > > > In xfs_bmap_btalloc, we try using the CoW extent size hint to force > > > > allocations to align (offset-wise) to cowextsz granularity to reduce CoW > > > > fragmentation. This works fine until we cannot satisfy the allocation > > > > with enough blocks to cover the requested range and the alignment hints. > > > > If this happens, return an unaligned region because if we don't the > > > > extent trim functions cause us to return a zero-length extent to iomap, > > > > which iomap doesn't catch and thus blows up. > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > --- > > > > > > Hmm.. is this a direct I/O thing? The description of the problem had me > > > > Yes. > > > > > wondering how we handle this with regard to dio and traditional extent > > > size hints. It looks like we just return -ENOSPC if xfs_bmapi_write() > > > doesn't return a mapping that covers the target range of the write (even > > > if it apparently attempts to allocate part of the associated extent size > > > hint range). E.g., see the nimaps == 0 check in > > > xfs_iomap_write_direct() after we commit the transaction. > > > > I did take a look at that, and didn't like it. > > > > There's enough free space to fill the dio write, but the free space > > itself is very fragmented so we can't honor the hint. We did however > > manage to allocate /some/ blocks, so we might as well return what we got > > and let the next iteration of the iomap_apply loop try to fill the rest > > of the write request. We already reserved enough space, so the write > > should succeed totally, not return to userspace with either a short > > write or ENOSPC just because free space is fragmented. > > > > I'm not following how a short write can necessarily be prevented, since > space reservation doesn't guarantee contiguity and afaict we only make a > single mapping call. I suppose the iomap level can loop, but that's > outside of the context where blocks are reserved. Hm? > > But regardless, this behavior seems reasonable to me if we apply it > consistently between cow fork hint behavior and traditional extent size > hint behavior. They are both hints, after all. I do think an nimaps > check might still be appropriate in that reflink code path simply to > cover the case of unexpected behavior or a bug, rather than brace for > whatever is going to happen if we continue to shuffle a bogus imap > around. Yes, the new version makes the behavior consistent for both hints, and adds the "got no blocks, so sad" check to _reflink_allocate_cow. --D > Brian > > > (The other problem is that if we return ENOSPC out of iomap_begin, that > > error code will bubble all the way back to userspace even if we /did/ > > write something, which means that even the programs that handle short > > dio writes correctly will see that ENOSPC and bail out. Goldwyn has > > been trying to fix that braindamage for some time now.) > > > > > In fact, it looks like just repeating the failed write could eventually > > > succeed if the issue is that there is actually enough free space > > > available to allocate the hint range up to where the write is targeted, > > > just no long enough extent available to fill the extent size hint range > > > in a single bmapi_write call. That behavior is a bit strange, I admit, > > > but I'm wondering if we could do the same thing for the cow hint. Would > > > a similar nimaps check in the xfs_bmapi_write() caller resolve the bug > > > described here? > > > > > > If so and if we still care to actually change/fix the allocation > > > behavior with regard to the hints, perhaps we could do that in a > > > separate patch more generically for both hints..? > > > > I get the feeling we could apply this change to all the data fork > > bmap_btalloc calls too. I'll go study that in more depth. > > > > --D > > > > > > > > Brian > > > > > > > fs/iomap.c | 2 +- > > > > fs/xfs/libxfs/xfs_bmap.c | 21 +++++++++++++++++++-- > > > > 2 files changed, 20 insertions(+), 3 deletions(-) > > > > > > > > > > > > diff --git a/fs/iomap.c b/fs/iomap.c > > > > index e5de772..aec35a0 100644 > > > > --- a/fs/iomap.c > > > > +++ b/fs/iomap.c > > > > @@ -63,7 +63,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags, > > > > ret = ops->iomap_begin(inode, pos, length, flags, &iomap); > > > > if (ret) > > > > return ret; > > > > - if (WARN_ON(iomap.offset > pos)) > > > > + if (WARN_ON(iomap.offset > pos) || WARN_ON(iomap.length == 0)) > > > > return -EIO; > > > > > > > > /* > > > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c > > > > index 93ce2c6..4ec1fdc5 100644 > > > > --- a/fs/xfs/libxfs/xfs_bmap.c > > > > +++ b/fs/xfs/libxfs/xfs_bmap.c > > > > @@ -3480,8 +3480,20 @@ xfs_bmap_btalloc_filestreams( > > > > static void > > > > xfs_bmap_btalloc_cow( > > > > struct xfs_bmalloca *ap, > > > > - struct xfs_alloc_arg *args) > > > > + struct xfs_alloc_arg *args, > > > > + xfs_fileoff_t orig_offset, > > > > + xfs_extlen_t orig_length) > > > > { > > > > + /* > > > > + * If we didn't get enough blocks to satisfy the cowextsize > > > > + * aligned request, break the alignment and return whatever we > > > > + * got; it's the best we can do. > > > > + */ > > > > + if (ap->length <= orig_length) > > > > + ap->offset = orig_offset; > > > > + else if (ap->offset + ap->length < orig_offset + orig_length) > > > > + ap->offset = orig_offset + orig_length - ap->length; > > > > + > > > > /* Filling a previously reserved extent; nothing to do here. */ > > > > if (ap->wasdel) > > > > return; > > > > @@ -3520,6 +3532,8 @@ xfs_bmap_btalloc( > > > > xfs_agnumber_t fb_agno; /* ag number of ap->firstblock */ > > > > xfs_agnumber_t ag; > > > > xfs_alloc_arg_t args; > > > > + xfs_fileoff_t orig_offset; > > > > + xfs_extlen_t orig_length; > > > > xfs_extlen_t blen; > > > > xfs_extlen_t nextminlen = 0; > > > > int nullfb; /* true if ap->firstblock isn't set */ > > > > @@ -3529,6 +3543,8 @@ xfs_bmap_btalloc( > > > > int stripe_align; > > > > > > > > ASSERT(ap->length); > > > > + orig_offset = ap->offset; > > > > + orig_length = ap->length; > > > > > > > > mp = ap->ip->i_mount; > > > > > > > > @@ -3745,7 +3761,8 @@ xfs_bmap_btalloc( > > > > ASSERT(nullfb || fb_agno <= args.agno); > > > > ap->length = args.len; > > > > if (ap->flags & XFS_BMAPI_COWFORK) { > > > > - xfs_bmap_btalloc_cow(ap, &args); > > > > + xfs_bmap_btalloc_cow(ap, &args, orig_offset, > > > > + orig_length); > > > > } else { > > > > ap->ip->i_d.di_nblocks += args.len; > > > > xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE); > > > > > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html