On Fri, Mar 29, 2024 at 03:51:49PM -0700, Darrick J. Wong wrote: > On Thu, Mar 28, 2024 at 09:15:25AM +1100, Dave Chinner wrote: > > On Tue, Mar 26, 2024 at 07:07:58PM -0700, Darrick J. Wong wrote: > > > From: Darrick J. Wong <djwong@xxxxxxxxxx> .... > > I think a commit message update is necessary. :) > > xfs: fix performance problems when fstrimming a subset of a fragmented AG > > On a 10TB filesystem where the free space in each AG is heavily > fragmented, I noticed some very high runtimes on a FITRIM call for the > entire filesystem. xfs_scrub likes to report progress information on > each phase of the scrub, which means that a strace for the entire > filesystem: > > ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839> > > shows that scrub is uncommunicative for the entire duration. Reducing > the size of the FITRIM requests to a single AG at a time produces lower > times for each individual call, but even this isn't quite acceptable, > because the time between progress reports are still very high: > > Strace for the first 4x 1TB AGs looks like (2): > ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033> > ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323> > ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226> > ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744> > > I then had the idea to limit the length parameter of each call to a > smallish amount (~11GB) so that we could report progress relatively > quickly, but much to my surprise, each FITRIM call still took ~68 > seconds! > > Unfortunately, the by-length fstrim implementation handles this poorly > because it walks the entire free space by length index (cntbt), which is > a very inefficient way to walk a subset of the blocks of an AG. > > Therefore, create a second implementation that will walk the bnobt and > perform the trims in block number order. This implementation avoids the > worst problems of the original code, though it lacks the desirable > attribute of freeing the biggest chunks first. > > On the other hand, this second implementation will be much easier to > constrain the system call latency, and makes it much easier to report > fstrim progress to anyone who's running xfs_scrub. Much better :) > > Here's a completely untested, uncompiled version of this by-bno > > search I just wrote up to demonstrate that if we pass ag-confined > > agbnos from the top level, we don't need to duplicate this gather > > code at all or do trim range constraining inside the gather > > functions... > > Mostly looks ok, but let's dig in-- > > > -Dave. > > > > diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c > > index 268bb734dc0a..0c949dfc097a 100644 > > --- a/fs/xfs/xfs_discard.c > > +++ b/fs/xfs/xfs_discard.c > > @@ -145,14 +145,18 @@ xfs_discard_extents( > > return error; > > } > > > > +struct xfs_trim_cur { > > + xfs_agblock_t start; > > + xfs_extlen_t count; > > + xfs_agblock_t end; > > + xfs_extlen_t minlen; > > + bool by_len; > > +}; > > > > static int > > xfs_trim_gather_extents( > > struct xfs_perag *pag, > > - xfs_daddr_t start, > > - xfs_daddr_t end, > > - xfs_daddr_t minlen, > > - struct xfs_alloc_rec_incore *tcur, > > + struct xfs_trim_cur *tcur, > > struct xfs_busy_extents *extents, > > uint64_t *blocks_trimmed) > > { > > @@ -179,21 +183,21 @@ xfs_trim_gather_extents( > > if (error) > > goto out_trans_cancel; > > > > - cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); > > > > - /* > > - * Look up the extent length requested in the AGF and start with it. > > - */ > > - if (tcur->ar_startblock == NULLAGBLOCK) > > - error = xfs_alloc_lookup_ge(cur, 0, tcur->ar_blockcount, &i); > > - else > > + if (tcur->by_len) { > > + cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag); > > I'm confused, why are we searching the by-bno btree if "by_len" is set? Because I changed the logic half way through writing it and forgot to change the variable name.... > Granted the logic for setting by_len triggers only for FITRIMs of > subsets of an AG so this functions conrrectly... > > > + error = xfs_alloc_lookup_ge(cur, tcur->ar_startblock, > > + 0, &i); > > ...insofar as this skips a free extent that starts before and ends after > tcur->start. Right - did I mention this wasn't even compiled? :) > > > + } else { > > + cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); > > error = xfs_alloc_lookup_le(cur, tcur->ar_startblock, > > tcur->ar_blockcount, &i); > > + } > > How about this initialization logic: > > if (tcur->by_bno) { > /* sub-AG discard request always starts at tcur->start */ > cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag); > error = xfs_alloc_lookup_le(cur, tcur->start, 0, &i); OK. > } else if (tcur->start == NULLAGBLOCK) { > /* first time through a by-len starts with max length */ > cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); > error = xfs_alloc_lookup_ge(cur, 0, tcur->count, &i); > } else { > /* nth time through a by-len starts where we left off */ > cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); > error = xfs_alloc_lookup_le(cur, tcur->start, tcur->count, &i); > } ... but that is what I was explicilty trying to avoid because the initial search on a by-count config is for tcur->count = pag->pagf_longest. IOWs, there is no "greater" sized free extent than tcur->count, every free extent *must* be less than or equal to pag->pagf_longest. The _ge() search is done because the start block of zero is less than every agbno in the tree. Hence we have to do a _ge() search to "find" the pag->pagf_longest extent based on start block criteria. If we pass in a tcur->start = NULLAGBLOCK (0xffffffff) or pag->pag_blocks then every agbno is "less than" the start block, and so it should return the extent at the right most edge of the tree with a _le() search regardless of where in the AG it is located. Hence the initial search can also be a _le() search and we don't have to special case it - the by-count tree has two components in it's search key and if we set the initial values of both components correctly one search works for all cases... > > @@ -248,10 +235,35 @@ xfs_trim_gather_extents( > > * supposed to discard skip it. Do not bother to trim > > * down partially overlapping ranges for now. > > */ > > - if (dbno + dlen < start || dbno > end) { > > + if (fbno + flen < tcur->start) { > > trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen); > > goto next_extent; > > } > > + if (fbno > tcur->end) { > > + trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen); > > + if (tcur->by_len) { > > + tcur->count = 0; > > + break; > > + } > > + goto next_extent; > > + } > > + > > + /* Trim the extent returned to the range we want. */ > > + if (fbno < tcur->start) { > > + flen -= tcur->start - fbno; > > + fbno = tcur->start; > > + } > > + if (fbno + flen > tcur->end + 1) > > + flen = tcur->end - fbno + 1; > > + > > + /* > > + * Too small? Give up. > > + */ > > + if (flen < minlen) { > > + trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen); > > + tcur->count = 0; > > + break; > > + } > > For a by-bno search, this logic skips the entire rest of the AG after > the first free extent that's smaller than tcur->minlen. Instead, it > should goto next_extent, yes? Yes. > > @@ -415,12 +438,23 @@ xfs_ioc_trim( > > start = BTOBB(range.start); > > end = start + BTOBBT(range.len) - 1; > > > > - if (end > XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1) > > - end = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1; > > Couldn't this simply be: > > end = min_t(xfs_daddr_t, start + BTOBBT(range.len) - 1, > XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1); Yes. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx