> On May 17, 2023, at 6:01 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Thu, May 18, 2023 at 12:10:53AM +0000, Wengang Wang wrote: >> >> >>> On May 17, 2023, at 3:49 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>> >>> On Wed, May 17, 2023 at 07:14:32PM +0000, Wengang Wang wrote: >>>>> On May 16, 2023, at 6:40 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>>>> On Tue, May 16, 2023 at 05:59:13PM -0700, Darrick J. Wong wrote: >>>>> I was thinking this code changes to: >>>>> >>>>> flags |= XFS_ALLOC_FLAG_TRY_FLUSH; >>>>> .... >>>>> <attempt allocation> >>>>> .... >>>>> if (busy) { >>>>> xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR); >>>>> trace_xfs_alloc_size_busy(args); >>>>> error = xfs_extent_busy_flush(args->tp, args->pag, >>>>> busy_gen, flags); >>>>> if (!error) { >>>>> flags &= ~XFS_ALLOC_FLAG_TRY_FLUSH; >>>> >>>> What’s the benefits to use XFS_ALLOC_FLAG_TRY_FLUSH? >>>> If no change happened to pagb_gen, we would get nothing good in the retry >>>> but waste cycles. Or I missed something? >>> >>> You missed something: the synchronous log force is always done. >>> >> >> It’s true that synchronous log force is done. >> >>> The log force is what allows busy extents to be resolved - busy >>> extents have to be committed to stable storage before they can be >>> removed from the busy extent tree. >> >> Yep. >> >>> >>> If online discards are not enabled, busy extents are resolved >>> directly in journal IO completion - the log force waits for this to >>> occur. In this case, pag->pagb_gen will have already incremented to >>> indicate progress has been made, and we should never wait in the >>> loop after the log force. The only time we do that is when the >>> current transaction holds busy extents itself, and hence if the >>> current tx holds busy extents we should not wait beyond the log >>> force.... >> >> So you are talking about the case of “pagb_gen will have already incremented”, >> In this case your next two lines: >> >> if (busy_gen != READ_ONCE(pag->pagb_gen)) >> return 0; >> >> would capture that and return immediately without waiting. So TRY_FLUSH is not >> helpful in this case. > > Except when pag->pagb_gen doesn't change. If it hasn't changed, then > we'd immediately return -EAGAIN without trying again. That is not > what the behaviour we want; we want the allocation to always try > again at least once after a flush has been run, because .... > >>> If online discards are enabled, then they'll be scheduled by journal >>> IO completion. i.e. waiting on the log force guarntees pending >>> discards have been scheduled and they'll start completing soon after >>> the log force returns. When they complete they'll start incrementing >>> pag->pagb_gen. This is the case the pag->pagb_gen wait loop exists >>> for - it waits for a discard to complete and resolve the busy extent >>> in it's IO compeltion routine. At which point the allocation attempt >>> can restart. >> >> In above case, you are talking about the case that pag->pagb_gen will be >> incremented soon without any blocking. >> In this case, in the allocator path, still the two lines: >> if (busy_gen != READ_ONCE(pag->pagb_gen)) >> return 0; >> >> The condition of "busy_gen != READ_ONCE(pag->pagb_gen)” can be true >> or a false according to the race of this process VS the queue workers (which >> start completing). In case it’s true, the code return immediately. Otherwise >> the code runs into the loop waiting above condition become true. When >> the queue workers incremented pag->pagb_gen, the allocator path jumps >> out the loop and go — that’s perfect. I don't see why TRY_FLUSH is needed. > > .... we sample pag->pagb_gen part way through the extent search and > we might sample it multiple times if we encounter multiple busy > extents. The way the cursor currently works is that it stores the > last busy gen returned. The pag->pagb_gen value might change between > the first busy extent we encounter and the last that we encounter in > the search. If this happens, it implies that some busy extents have > resolved while we have been searching. > > In that case, we can enter the flush with busy_gen = pag->pagb_gen, > but that doesn't mean no busy extents have been resolved since we > started the allocation search. Hence the flush itself may not > resolve any new busy extents, but we still may have other busy > extents that were resolved while we were scanning the tree. > Yep, there could be such situations though don’t know frequently that happens. > Yes, we could change the way we track the busy gen in the cursor, > but that likely has other subtle impacts. e.g. we now have to > consider the fact we may never get a wakeup because all busy extents > in flight have already been resolved before we go to sleep. Hm.. yep, this could be a future enhancement. > > It is simpler to always retry the allocation after a minimal flush > to capture busy extents that were resolved while we were scanning > the tree. This is a slow path - we don't care if we burn extra CPU > on a retry that makes no progress because this condition is only > likely to occur when we are trying to do allocation or freeing > extents very near ENOSPC. Agreed. > > The try-flush gives us a mechanism that always does a minimum flush > first before a retry. If that retry fails, then we block, fail or > retry based on whether progress has been made. But we always want > that initial retry before we block, fail or retry... > OK. >> So change the line of >> >> flags |= XFS_ALLOC_FLAG_TRY_FLUSH; >> >> to >> >> if (xfs_has_discard(mp)) >> flags |= XFS_ALLOC_FLAG_TRY_FLUSH; >> >> Sounds good? > > No. The allocator itself should know nothing about when discards are > enabled - the whole point of the busy extent infrastructure handling > discards is to abstract discard delays away from the allocator > itself. The allocator only cares if the extent is busy or not, it > doesn't care why the extent is busy. And, as per above, the > TRY_FLUSH triggered retry is needed regardless of discards because > busy extents can resolve while a search is in progress instead of > being resolved by the log force... OK, will be with TRY_FLUSH. thanks, wengang