On Wed, May 19, 2021 at 10:13:17PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > I wrote up a description of how transactions, space reservations and > relogging work together in response to a question for background > material on the delayed logging design. Add this to the existing > document for ease of future reference. > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > .../xfs-delayed-logging-design.rst | 361 ++++++++++++++++-- > 1 file changed, 322 insertions(+), 39 deletions(-) > > diff --git a/Documentation/filesystems/xfs-delayed-logging-design.rst b/Documentation/filesystems/xfs-delayed-logging-design.rst > index 464405d2801e..395c63ca5b27 100644 > --- a/Documentation/filesystems/xfs-delayed-logging-design.rst > +++ b/Documentation/filesystems/xfs-delayed-logging-design.rst > @@ -1,29 +1,314 @@ > .. SPDX-License-Identifier: GPL-2.0 > > -========================== > -XFS Delayed Logging Design > -========================== > - > -Introduction to Re-logging in XFS > -================================= > - > -XFS logging is a combination of logical and physical logging. Some objects, > -such as inodes and dquots, are logged in logical format where the details > -logged are made up of the changes to in-core structures rather than on-disk > -structures. Other objects - typically buffers - have their physical changes > -logged. The reason for these differences is to reduce the amount of log space > -required for objects that are frequently logged. Some parts of inodes are more > -frequently logged than others, and inodes are typically more frequently logged > -than any other object (except maybe the superblock buffer) so keeping the > -amount of metadata logged low is of prime importance. > - > -The reason that this is such a concern is that XFS allows multiple separate > -modifications to a single object to be carried in the log at any given time. > -This allows the log to avoid needing to flush each change to disk before > -recording a new change to the object. XFS does this via a method called > -"re-logging". Conceptually, this is quite simple - all it requires is that any > -new change to the object is recorded with a *new copy* of all the existing > -changes in the new transaction that is written to the log. > +================== > +XFS Logging Design > +================== > + > +Preamble > +======== > + > +This document describes the design and algorithms that the XFS journalling > +subsystem is based on. This document describes the design and algorithms that > +the XFS journalling subsystem is based on so that readers may familiarize > +themselves with the general concepts of how transaction processing in XFS works. > + > +We begin with an overview of transactions in XFS, followed by describing how > +transaction reservations are structured and accounted, and then move into how we > +guarantee forwards progress for long running transactions with finite initial > +reservations bounds. At this point we need to explain how relogging works. With > +the basic concepts covered, the design of the delayed logging mechanism is > +documented. > + > + > +Introduction > +============ > + > +XFS uses Write Ahead Logging for ensuring changes to the filesystem metadata > +are atomic and recoverable. For reasons of space and time efficiency, the > +logging mechanisms are varied and complex, combining intents, logical and > +physical logging mechanisms to provide the necessary recovery guarantees the > +filesystem requires. > + > +Some objects, such as inodes and dquots, are logged in logical format where the > +details logged are made up of the changes to in-core structures rather than > +on-disk structures. Other objects - typically buffers - have their physical > +changes logged. Long running atomic modifications have individual changes > +chained together by intents, ensuring that journal recovery can restart and > +finish an operation that was only partially done when the system stopped > +functioning. > + > +The reason for these differences is to keep the amount of log space and CPU time > +required to process objects being modified as small as possible and hence the > +logging overhead as low as possible. Some items are very frequently modified, > +and some parts of objects are more frequently modified than others, so keeping > +the overhead of metadata logging low is of prime importance. > + > +The method used to log an item or chain modifications together isn't > +particularly important in the scope of this document. It suffices to know that > +the method used for logging a particular object or chaining modifications > +together are different and are dependent on the object and/or modification being > +performed. The logging subsystem only cares that certain specific rules are > +followed to guarantee forwards progress and prevent deadlocks. > + > + > +Transactions in XFS > +=================== > + > +XFS has two types of high level transactions, defined by the type of log space > +reservation they take. These are known as "one shot" and "permanent" > +transactions. Permanent transaction reservations can take reservations that span > +commit boundaries, whilst "one shot" transactions are for a single atomic > +modification. > + > +The type and size of reservation must be matched to the modification taking > +place. This means that permanent transactions can be used for one-shot > +modifications, but one-shot reservations cannot be used for permanent > +transactions. > + > +In the code, a one-shot transaction pattern looks somewhat like this:: > + > + tp = xfs_trans_alloc(<reservation>) > + <lock items> > + <join item to transaction> > + <do modification> > + xfs_trans_commit(tp); > + > +As items are modified in the transaction, the dirty regions in those items are > +tracked via the transaction handle. Once the transaction is committed, all > +resources joined to it are released, along with the remaining unused reservation > +space that was taken at the transaction allocation time. > + > +In contrast, a permanent transaction is made up of multiple linked individual > +transactions, and the pattern looks like this:: > + > + tp = xfs_trans_alloc(<reservation>) > + xfs_ilock(ip, XFS_ILOCK_EXCL) > + > + loop { > + xfs_trans_ijoin(tp, 0); > + <do modification> > + xfs_trans_log_inode(tp, ip); > + xfs_trans_roll(&tp); > + } > + > + xfs_trans_commit(tp); > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > + > +While this might look similar to a one-shot transaction, there is an important > +difference: xfs_trans_roll() performs a specific operation that links two > +transactions together:: > + > + ntp = xfs_trans_dup(tp); > + xfs_trans_commit(tp); > + xfs_log_reserve(ntp); > + > +This results in a series of "rolling transactions" where the inode is locked > +across the entire chain of transactions. Hence while this series of rolling > +transactions is running, nothing else can read from or write to the inode and > +this provides a mechanism for complex changes to appear atomic from an external > +observer's point of view. > + > +It is important to note that a series of rolling transactions in a permanent > +transaction does not form an atomic change in the journal. While each > +individual modification is atomic, the chain is *not atomic*. If we crash half > +way through, then recovery will only replay up to the last transactional > +modification the loop made that was committed to the journal. > + > +This affects long running permanent transactions in that it is not possible to > +predict how much of a long running operation will actually be recovered because > +there is no guarantee of how much of the operation reached stale storage. Hence > +if a long running operation requires multiple transactions to fully complete, > +the high level operation must use intents and deferred operations to guarantee > +recovery can complete the operation once the first transactions is persisted in > +the on-disk journal. > + > + > +Transactions are Asynchronous > +============================= > + > +In XFS, all high level transactions are asynchronous by default. This means that > +xfs_trans_commit() does not guarantee that the modification has been committed > +to stable storage when it returns. Hence when a system crashes, not all the > +completed transactions will be replayed during recovery. > + > +However, the logging subsystem does provide global ordering guarantees, such > +that if a specific change is seen after recovery, all metadata modifications > +that were committed prior to that change will also be seen. > + > +For single shot operations that need to reach stable storage immediately, or > +ensuring that a long running permanent transaction is fully committed once it is > +complete, we can explicitly tag a transaction as synchronous. This will trigger > +a "log force" to flush the outstanding committed transactions to stable storage > +in the journal and wait for that to complete. > + > +Synchronous transactions are rarely used, however, because they limit logging > +throughput to the IO latency limitations of the underlying storage. Instead, we > +tend to use log forces to ensure modifications are on stable storage only when > +a user operation requires a synchronisation point to occur (e.g. fsync). > + > + > +Transaction Reservations > +======================== > + > +It has been mentioned a number of times now that the logging subsystem needs to > +provide a forwards progress guarantee so that no modification ever stalls > +because it can't be written to the journal due to a lack of space in the > +journal. This is achieved by the transaction reservations that are made when > +a transaction is first allocated. For permanent transactions, these reservations > +are maintained as part of the transaction rolling mechanism. > + > +A transaction reservation provides a guarantee that there is physical log space > +available to write the modification into the journal before we start making > +modifications to objects and items. As such, the reservation needs to be large > +enough to take into account the amount of metadata that the change might need to > +log in the worst case. This means that if we are modifying a btree in the > +transaction, we have to reserve enough space to record a full leaf-to-root split > +of the btree. As such, the reservations are quite complex because we have to > +take into account all the hidden changes that might occur. > + > +For example, a user data extent allocation involves allocating an extent from > +free space, which modifies the free space trees. That's two btrees. Inserting > +the extent into the inode's extent map might require a split of the extent map > +btree, which requires another allocation that can modify the free space trees > +again. Then we might have to update reverse mappings, which modifies yet > +another btree which might require more space. And so on. Hence the amount of > +metadata that a "simple" operation can modify can be quite large. > + > +This "worst case" calculation provides us with the static "unit reservation" > +for the transaction that is calculated at mount time. We must guarantee that the > +log has this much space available before the transaction is allowed to proceed > +so that when we come to write the dirty metadata into the log we don't run out > +of log space half way through the write. > + > +For one-shot transactions, a single unit space reservation is all that is > +required for the transaction to proceed. For permanent transactions, however, we > +also have a "log count" that affects the size of the reservation that is to be > +made. > + > +While a permanent transaction can get by with a single unit of space > +reservation, it is somewhat inefficient to do this as it requires the > +transaction rolling mechanism to re-reserve space on every transaction roll. We > +know from the implementation of the permanent transactions how many transaction > +rolls are likely for the common modifications that need to be made. > + > +For example, and inode allocation is typically two transactions - one to > +physically allocate a free inode chunk on disk, and another to allocate an inode > +from an inode chunk that has free inodes in it. Hence for an inode allocation > +transaction, we might set the reservation log count to a value of 2 to indicate > +that the common/fast path transaction will commit two linked transactions in a > +chain. Each time a permanent transaction rolls, it consumes an entire unit > +reservation. > + > +Hence when the permanent transaction is first allocated, the log space > +reservation is increases from a single unit reservation to multiple unit > +reservations. That multiple is defined by the reservation log count, and this > +means we can roll the transaction multiple times before we have to re-reserve > +log space when we roll the transaction. This ensures that the common > +modifications we make only need to reserve log space once. > + > +If the log count for a permanent transaction reaches zero, then it needs to > +re-reserve physical space in the log. This is somewhat complex, and requires > +an understanding of how the log accounts for space that has been reserved. > + > + > +Log Space Accounting > +==================== > + > +The position in the log is typically referred to as a Log Sequence Number (LSN). > +The log is circular, so the positions in the log are defined by the combination > +of a cycle number - the number of times the log has been overwritten - and the > +offset into the log. A LSN carries the cycle in the upper 32 bits and the > +offset in the lower 32 bits. The offset is in units of "basic blocks" (512 > +bytes). Hence we can do realtively simple LSN based math to keep track of > +available space in the log. > + > +Log space accounting is done via a pair of constructs called "grant heads". The > +position of the grant heads is an absolute value, so the amount of space > +available in the log is defined by the distance between the position of the > +grant head and the current log tail. That is, how much space can be > +reserved/consumed before the grant heads would fully wrap the log and overtake > +the tail position. > + > +The first grant head is the "reserve" head. This tracks the byte count of the > +reservations currently held by active transactions. It is a purely in-memory > +accounting of the space reservation and, as such, actually tracks byte offsets > +into the log rather than basic blocks. Hence it technically isn't using LSNs to > +represent the log position, but it is still treated like a split {cycle,offset} > +tuple for the purposes of tracking reservation space. > + > +The reserve grant head is used to accurately account for exact transaction > +reservations amounts and the exact byte count that modifications actually make > +and need to write into the log. The reserve head is used to prevent new > +transactions from taking new reservations when the head reaches the current > +tail. It will block new reservations in a FIFO queue and as the log tail moves > +forward it will wake them in order once sufficient space is available. This FIFO > +mechanism ensures no transaction is starved of resources when log space > +shortages occur. > + > +The other grant head is the "write" head. Unlike the reserve head, this grant > +head contains an LSN and it tracks the physical space usage in the log. While > +this might sound like it is accounting the same state as the reserve grant head > +- and it mostly does track exactly the same location as the reserve grant head - > +there are critical differences in behaviour between them that provides the > +forwards progress guarantees that rolling permanent transactions require. > + > +These differences when a permanent transaction is rolled and the internal "log > +count" reaches zero and the initial set of unit reservations have been > +exhausted. At this point, we still require a log space reservation to continue > +the next transaction in the sequeunce, but we have none remaining. We cannot > +sleep during the transaction commit process waiting for new log space to become > +available, as we may end up on the end of the FIFO queue and the items we have > +locked while we sleep could end up pinning the tail of the log before there is > +enough free space in the log to fulfil all of the pending reservations and > +then wake up transaction commit in progress. > + > +To take a new reservation without sleeping requires us to be able to take a > +reservation even if there is no reservation space currently available. That is, > +we need to be able to *overcommit* the log reservation space. As has already > +been detailed, we cannot overcommit physical log space. However, the reserve > +grant head does not track physical space - it only accounts for the amount of > +reservations we currently have outstanding. Hence if the reserve head passes > +over the tail of the log all it means is that new reservations will be throttled > +immediately and remain throttled until the log tail is moved forward far enough > +to remove the overcommit and start taking new reservations. In other words, we > +can overcommit the reserve head without violating the physical log head and tail > +rules. > + > +As a result, permanent transactions only "regrant" reservation space during > +xfs_trans_commit() calls, while the physical log space reservation - tracked by > +the write head - is then reserved separately by a call to xfs_log_reserve() > +after the commit completes. Once the commit completes, we can sleep waiting for > +physical log space to be reserved from the write grant head, but only if one > +critical rule has been observed:: > + > + Code using permanent reservations must always log the items they hold > + locked across each transaction they roll in the chain. > + > +"Re-logging" the locked items on every transaction roll ensures that the items > +the transaction chain is rolling are always relocated to the physical head of This reads (to me) a little awkwardly. One could ask if the transaction chain itself is rolling the items? Which is not really what's happening. How about: "...ensures that the items attached to the transaction chain being rolled are always relocated..." > +the log and so do not pin the tail of the log. If a locked item pins the tail of > +the log when we sleep on the write reservation, then we will deadlock the log as > +we cannot take the locks needed to write back that item and move the tail of the > +log forwards to free up write grant space. Re-logging the locked items avoids > +this deadlock and guarantees that the log reservation we are making cannot > +self-deadlock. > + > +If all rolling transactions obey this rule, then they can all make forwards > +progress independently because nothing will block the progress of the log > +tail moving forwards and hence ensuring that write grant space is always > +(eventually) made available to permanent transactions no matter how many times > +they roll. > + > + > +Re-logging Explained > +==================== > + > +XFS allows multiple separate modifications to a single object to be carried in > +the log at any given time. This allows the log to avoid needing to flush each > +change to disk before recording a new change to the object. XFS does this via a > +method called "re-logging". Conceptually, this is quite simple - all it requires > +is that any new change to the object is recorded with a *new copy* of all the > +existing changes in the new transaction that is written to the log. > > That is, if we have a sequence of changes A through to F, and the object was > written to disk after change D, we would see in the log the following series > @@ -42,16 +327,13 @@ transaction:: > In other words, each time an object is relogged, the new transaction contains > the aggregation of all the previous changes currently held only in the log. > > -This relogging technique also allows objects to be moved forward in the log so > -that an object being relogged does not prevent the tail of the log from ever > -moving forward. This can be seen in the table above by the changing > -(increasing) LSN of each subsequent transaction - the LSN is effectively a > -direct encoding of the location in the log of the transaction. > +This relogging technique allows objects to be moved forward in the log so that > +an object being relogged does not prevent the tail of the log from ever moving > +forward. This can be seen in the table above by the changing (increasing) LSN > +of each subsequent transaction, and it's the technique that allows us to > +implement long-running, multiple-commit permanent transactions. > > -This relogging is also used to implement long-running, multiple-commit > -transactions. These transaction are known as rolling transactions, and require > -a special log reservation known as a permanent transaction reservation. A > -typical example of a rolling transaction is the removal of extents from an > +A typical example of a rolling transaction is the removal of extents from an > inode which can only be done at a rate of two extents per transaction because Ignoring rt files, do we even have /that/ limit anymore? Especially considering the other patchset you just sent... :) With that one odd sentence up there reworked, Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx> --D > of reservation size limitations. Hence a rolling extent removal transaction > keeps relogging the inode and btree buffers as they get modified in each > @@ -67,12 +349,13 @@ the log over and over again. Worse is the fact that objects tend to get > dirtier as they get relogged, so each subsequent transaction is writing more > metadata into the log. > > -Another feature of the XFS transaction subsystem is that most transactions are > -asynchronous. That is, they don't commit to disk until either a log buffer is > -filled (a log buffer can hold multiple transactions) or a synchronous operation > -forces the log buffers holding the transactions to disk. This means that XFS is > -doing aggregation of transactions in memory - batching them, if you like - to > -minimise the impact of the log IO on transaction throughput. > +It should now also be obvious how relogging and asynchronous transactions go > +hand in hand. That is, transactions don't get written to the physical journal > +until either a log buffer is filled (a log buffer can hold multiple > +transactions) or a synchronous operation forces the log buffers holding the > +transactions to disk. This means that XFS is doing aggregation of transactions > +in memory - batching them, if you like - to minimise the impact of the log IO on > +transaction throughput. > > The limitation on asynchronous transaction throughput is the number and size of > log buffers made available by the log manager. By default there are 8 log > -- > 2.31.1 >