Hi folks, One of the significant limitations of the log reservation code is that it uses physical tracking of the reservation space to account for both the space used in the journal as well as the reservations held in memory by the CIL and activei running transactions. Because this in-memory reservation tracking requires byte-level granularity, this means that the "LSN" that the grant head stores it's location in is split into 32 bits for the log cycle and 32 bits for the grant head offset into the log. Storing a byte count as the grant head offset into the log means that we can only index 4GB of space with the grant head. This is one of the primary limiting factors preventing us from increasing the physical log size beyond 2GB. Hence to increase the physical log size, we have to increase the space available for storing the grant head. Needing more physical space to store the grant head is an issue because we use lockless atomic accounting for the grant head to minimise the overhead of new incoming transaction reservations. These have unbound concurrency, and hence any lock in the reservation path will cause serious scalability issues. The lockless accounting fast path was the solution to these scalability problems that we had over a decade ago, and hence we know we cannot go back to a lock based solution. Therefore we are still largely limited to the storage space we can perform atomic operations on. We already use 64 bit compare/exchange operations, and there is not widespread hardware support for 128 bit atomic compare/exchange operations so increasing the grant head LSN to a structure > 64 bits in size is not really an option. Hence we have to look for a different solution - one that doesn't require us to increase the amount of storage space for the grant head. This is where we need to recognise that the grant head is actually tracking three things: 1. physical log space that is tracked by the AIL; 2. physical log space that the CIL will soon consume; and 3. potential log space that active transactions *may* consume. One of the tricks that the grant heads play is that the don't need to explicitly track the space consumed by the AIL (#1), because the consumed log space is simply "grant head - log tail", and so it doesn't not matter how the space that is consumed moves between the three separate accounting groups. Reservation space is automatically returned to the "available pool" by the AIL moving the log tail forwards. Hence the grant head only needs to account for the journal space that transactions consume as they complete, and never have to be updated to account for metadata writeback emptying the journal. This all works because xlog_space_left() is a calculation of the difference between two LSNs - the log tail and the grant head. When the grant head wraps the log tail, we've run out of log space and the journal reservations get throttled until the log tail is moved forward to "unwrap" the grant head and make space available again. But there's no reason why we have to track log space in this way to determine that we've run out of reservation space - all we need is for xlog_space_left() to be able to accurately calculate when we've run out of space. So let's break this down. Firstly, the AIL tracks all the items in the journal, and so at any given time it should know exactly where the on-disk head and tail of the journal are located. At the moment, we only know where the tail is (xfs_ail_min_lsn()), and we update the log tail (log->l_tail_lsn) whenever the AIL minimum LSN changes. The AIL will see the maximum committed LSN, but it does not track this. Instead, the log tracks this as log->l_last_sync_lsn and updates this directly in iclog IO completion when a iclog has callbacks attached. That is, log->l_last_sync_lsn is updated whenever journal IO completion is going to insert the latest committed log items into the AIL. If the AIL is empty, the log tail is assigned the value stored in l_last_sync_lsn as the log tail now points to the last written checkpoint in the journal. The simplest way I can describe how we track the log space is as follows: l_tail_lsn l_last_sync_lsn grant head lsn |-----------------------|+++++++++++++++++++++| | physical space | in memory space | | - - - - - - xlog_space_left() - - - - - - - | It is simple for the AIL to track the maximum LSN that has been inserted into the AIL. If we do this, we no longer need to track log->l_last_sync_lsn in the journal itself and we can always get the physical space tracked by the journal directly from the AIL. The AIL functions can calculate the "log tail space" dynamically when either the log tail or the max LSN seen changes, thereby removing all need for the log itself to track this state. Hence we now have: l_tail_lsn ail_head_lsn grant head lsn |-----------------------|+++++++++++++++++++++| | log->l_tail_space | in memory space | | - - - - - - xlog_space_left() - - - - - - - | And we've solved the problem of efficiently calculating the amount of physical space the log is consuming. All this leaves is now calculating how much space we are consuming in memory. Luckily for us, we've just added all the update hooks needed to do this. From the above diagram, two things are obvious: 1. when the tail moves, only log->l_tail_space reduces 2. when the ail_max_lsn_seen increases, log->l_tail_space increases and "in memory space" reduces by the same amount. IOWs, we now have a mechanism that can transfer the in-memory reservation space directly to the on-disk tail space accounting. At this point, we can change the grant head from tracking physical location to tracking a simple byte count: l_tail_lsn ail_head_lsn grant head bytes |-----------------------|+++++++++++++++++++++| | log->l_tail_space | grant space | | - - - - - - xlog_space_left() - - - - - - - | and xlog_space_left() simply changes to: space left = log->l_logsize - grant space - log->l_tail_space; All of the complex grant head cracking, combining and compare/exchange code gets replaced by simple atomic add/sub operations, and the grant heads can now track a full 64 bit bytes space. The fastpath reservation accounting is also much faster because it is much simpler. There's one little problem, though. The transaction reservation code has to set the LSN target for the AIL to push to ensure that the log tail keeps moving forward (xlog_grant_push_ail()), and the deferred intent logging code also tries to keep abreast of the amount of space available in the log via xlog_grant_push_threshold(). The AIL pushing problem is actually easy to solve - we don't need to push the AIL from the transaction reservation code as the AIL already tracks all the space used by the journal. All the transaction reservation code does is try to keep 25% of the journal physically free once the AIL has items in it. Of course there is the corner case where the AIL can be empty and the reservations fully depleted, in which case we have to ensure that we kick the AIL regardless of it's state when a transaction goes to sleep on waiting for reservation space. Hence before we start changing any of the grant head accounting, we remove all the AIL pushing hooks from the reservation code and let the AIL determine the target it needs to push to itself. We also allow the deferred intent logging code to determine if the AIL should be tail pushing similar to how it currently checks if we are running out of log space, so the intent relogging still works as it should. WIth these changes in place, there is no external code that is dependent on the grant heads tracking physical space, and hence we can then implement the change to pure in-memory reservation space tracking in the grant heads..... This all passes fstests for default and rmapbt enable configs. Performance tests also show good improvements where the transaction accounting is the bottleneck. This has been written and tested on top of the CIL scalability, inode unlink item and lockless buffer lookup patchesets, so if you want to test this you are probably best to start with all of them applied first. -Dave. --- Version 2 - reorder moving xfs_trans_bulk_commit() patch to start of series - fix failure to consider NULLCOMMITLSN push target in AIL - grant space release based on ctx->start_lsn fails to release the space used in the checkpoint that was just committed. Release needs to be based on the the ctx->commit_lsn which is the end of the region that the checkpoint consumes in the log. - rename ail_max_seen_lsn to ail_head_lsn, and convert it to tracking the commit lsn of the latest checkpoint. This effectively replaces log->l_last_sync_lsn. - move AIL lsn updates and grant space returns to before we process the logvec chain to insert the new items into the AIL. This is necessary to avoid a transient window where the head of the AIL moves forward, increasing log tail space, but we haven't yet reduced the grant reservation space and hence available log space drops by the size of the checkpoint for the duration of the AIL insertion process before returning to where it should be. - add memory barriers to the grant head return and xlog_space_left() functions to ensure that xlog_space_left() will always see the updated log tail space if it sees a grant head that has had the space returned to it. This prevents transients where the tail can lag the head by 2 cycles as the log head wraps. - lots of other minor stuff.... Original RFC: - https://lore.kernel.org/linux-xfs/20220708015558.1134330-1-david@xxxxxxxxxxxxx/