On Wed, Mar 12, 2025 at 08:46:36AM -0700, Darrick J. Wong wrote: > On Wed, Mar 12, 2025 at 01:35:23AM -0700, Christoph Hellwig wrote: > > On Wed, Mar 12, 2025 at 08:27:05AM +0000, John Garry wrote: > > > On 12/03/2025 07:24, Christoph Hellwig wrote: > > > > On Mon, Mar 10, 2025 at 06:39:39PM +0000, John Garry wrote: > > > > > Refactor xfs_reflink_end_cow_extent() into separate parts which process > > > > > the CoW range and commit the transaction. > > > > > > > > > > This refactoring will be used in future for when it is required to commit > > > > > a range of extents as a single transaction, similar to how it was done > > > > > pre-commit d6f215f359637. > > > > > > > > Darrick pointed out that if you do more than just a tiny number > > > > of extents per transactions you run out of log reservations very > > > > quickly here: > > > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20240329162936.GI6390@frogsfrogsfrogs/__;!!ACWV5N9M2RV99hQ!PWLcBof1tKimKUObvCj4vOhljWjFmjtzVHLx9apcU5Rah1xZnmp_3PIq6eSwx6TdEXzMLYYyBfmZLgvj$ > > > > > > > > how does your scheme deal with that? > > > > > > > The resblks calculation in xfs_reflink_end_atomic_cow() takes care of this, > > > right? Or does the log reservation have a hard size limit, regardless of > > > that calculation? > > > > The resblks calculated there are the reserved disk blocks Used for btree block allocations that might be needed during the processing of the transaction. > > and have > > nothing to do with the log reservations, which comes from the > > tr_write field passed in. There is some kind of upper limited to it > > obviously by the log size, although I'm not sure if we've formalized > > that somewhere. Dave might be the right person to ask about that. > > The (very very rough) upper limit for how many intent items you can > attach to a tr_write transaction is: > > per_extent_cost = (cui_size + rui_size + bui_size + efi_size + ili_size) > max_blocks = tr_write::tr_logres / per_extent_cost > > (ili_size is the inode log item size) That doesn't sound right. The number of intents we can log is not dependent on the aggregated size of all intent types. We do not log all those intent types in a single transaction, nor do we process more than one type of intent in a given transaction. Also, we only log the inode once per transaction, so that is not a per-extent overhead. Realistically, the tr_write transaction is goign to be at least a 100kB because it has to be big enough to log full splits of multiple btrees (e.g. BMBT + both free space trees). Yeah, a small 4kB filesystem spits out: xfs_trans_resv_calc: dev 7:0 type 0 logres 193528 logcount 5 flags 0x4 About 190kB. However, intents are typically very small - around 32 bytes in size plus another 12 bytes for the log region ophdr. This implies that we can fit thousands of individual intents in a single tr_write log reservation on any given filesystem, and the number of loop iterations in a transaction is therefore dependent largely on how many intents are logged per iteration. Hence if we are walking a range of extents in the BMBT to unmap them, then we should only be generating 2 intents per loop - a BUI for the BMBT removal and a CUI for the shared refcount decrease. That means we should be able to run at least a thousand iterations of that loop per transaction without getting anywhere near the transaction reservation limits. *However!* We have to relog every intent we haven't processed in the deferred batch every-so-often to prevent the outstanding intents from pinning the tail of the log. Hence the larger the number of intents in the initial batch, the more work we have to do later on (and the more overall log space and bandwidth they will consume) to relog them them over and over again until they pop to the head of the processing queue. Hence there is no real perforamce advantage to creating massive intent batches because we end up doing more work later on to relog those intents to prevent journal space deadlocks. It also doesn't speed up processing, because we still process the intent chains one at a time from start to completion before moving on to the next high level intent chain that needs to be processed. Further, after the first couple of intent chains have been processed, the initial log space reservation will have run out, and we are now asking for a new resrevation on every transaction roll we do. i.e. we now are now doing a log space reservation on every transaction roll in the processing chain instead of only doing it once per high level intent chain. Hence from a log space accounting perspective (the hottest code path in the journal), it is far more efficient to perform a single high level transaction per extent unmap operation than it is to batch intents into a single high level transaction. My advice is this: we should never batch high level iterative intent-based operations into a single transaction because it's a false optimisation. It might look like it is an efficiency improvement from the high level, but it ends up hammering the hot, performance critical paths in the transaction subsystem much, much harder and so will end up being slower than the single transaction per intent-based operation algorithm when it matters most.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx