Hi Folks, V4 of the iunlink log item patchset. Note that this has been rebased on the xfs-perag-conv-5.20 topic branch to avoid the horrible merge conflicts the previous version of this patchset had with the perag conversions. hence you'll need to pull this branch: git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs tags/xfs-perag-conv-5.20 before applying this patchset to test it. -Dave. ----- Inode cluster buffer pinning by dirty inodes allows us to improve dirty inode tracking efficiency in the log by logging the inode cluster buffer as an ordered transaction. However, this brings with it some new issues, namely the order in which we lock inode cluster buffers. While the dirty buffer pinning doesn't currently result in nested cluster buffer locking in xfs_trans_log_inode() and so there are no current problems, future changes will need to nest cluster buffer locking in this context. That is, transactions that dirty and commit multiple inodes in a transaction will now need to locking multiple inode cluster buffers in each transaction (e.g. create, rename, etc). This introduces new lock ordering constraints in these operations. It also introduces lock ordering constraints between the AGI and inode cluster buffers as a result of allocation/freeing being serialised by the AGI buffer lock. And then there is unlinked inode list logging, which currently has no fixed order of inode cluster buffer locking. It's all a bit Wild West right now. Locking pure inode modifications in order is relatively easy. We don't actually need to attach and log the buffer to the transaction until the last moment. We have all the inodes locked, so nothing other than unlinked inode list modification can race with the transaction modifying inodes. Hence we can safely move the attachment of the inodes to the cluster buffer from when we first dirty them in xfs_trans_log_inode to just before we commit the transaction. At this point, all the inodes that have been dirtied in the transaction have already been locked, modified, logged and attached to the transaction. Hence if we add a hook into xfs_trans_commit() to run a "precommit" operation on these log items, we can use this operation to attach the inodes to the cluster buffer at commit time instead of in xfs_trans_log_inode(). This, by itself, doesn't solve the lock ordering problem. What it does do, however, is give us a place where we can -order- all the dirty items in the transaction list. Hence before we call the precommit operation on each log item, we sort them. This allows us to sort all the inode items so that the pre-commit functions that locks and logs the cluster buffers are run in a deterministic order. This solves the lock order problem for pure inode modifications. The unlinked inode list buffer locking is more complex. The unlinked list is unordered - we add to the tail, remove from where-ever the inode is in the list. Hence we might need to lock two inode buffers here (previous inode in list and the one being removed). While we can order the locking of these buffers correctly within the confines of the unlinked list, there may be other inodes that need buffer locking in the same transaction. e.g. O_TMPFILE being linked into a directory also modifies the directory inode. Hence we need a mechanism for defering unlinked inode list updates to the pre-commit operation where it can be sorted into the correct order. We can do this by first observing that we serialise unlinked list modifications by holding the AGI buffer lock. IOWs, the AGI is going to be locked until the transaction commits any time we modify the unlinked list. Hence it doesn't matter when in the unlink transactions that we actually load, lock and modify the inode cluster buffer. IOWs, what we need is an unlinked inode log item to defer the inode cluster buffer update to transaction commit time where it can be ordered with all the other inode cluster operations. Essentially all we need to do is record the inodes that need to have their unlinked list pointer updated in a new log item that we attached to the transaction. This log item exists purely for the purpose of delaying the update of the unlinked list pointer until the inode cluster buffer can be locked in the correct order around the other inode cluster buffers. It plays no part in the actual commit, and there's no change to anything that is written to the log. i.e. the inode cluster buffers still have to be fully logged here (not just ordered) as log recovery depedends on this to replay mods to the unlinked inode list. To make this unlinked inode list processing simpler and easier to implement as a log item, we need to change the way we track the unlinked list in memory. Starting from the observation that an inode on the unlinked list is pinned in memory by the VFS, we can use the xfs_inode itself to track the unlinked list. To do this efficiently, we want the unlinked list to be a double linked list. The problem here is that we need a list per AGI unlinked list, and there are 64 of these per AGI. The approach taken in this patchset is to shadow the AGI unlinked list heads in the perag, and link inodes by agino, hence requiring only 8 extra bytes per inode to track this state. We can then use the agino pointers for lockless inode cache lookups to retreive the inode. The aginos in the inode are modified only under the AGI lock, just like the cluster buffer pointers, so we don't need any extra locking here. The i_next_unlinked field tracks the on-disk value of the unlinked list, and the i_prev_unlinked is a purely in-memory pointer that enables us to efficiently remove inodes from the middle of the list. This results in moving a lot of the unlink modification work into the precommit operations on the unlink log item. tracking all the unlinked inodes in the inodes themselves also gets rid of the unlinked list reference hash table that is used to track this back pointer relationship. This greatly simplifies the the unlinked list modification code, and removes memory allocations in this hot path to track back pointers. This, overall, slightly reduces the CPU overhead of the unlink path. The result of this log item means that we move all the actual manipulation of objects to be logged out of the iunlink path and into the iunlink item. This allows for future optimisation of this mechanism without needing changes to high level unlink path, as well as making the unlink lock ordering predictable and synchronised with other operations that may require inode cluster locking. Changelog: Version 4: - rebase on 5.19-rc2 + xfs-perag-conv-5.20 as the changes to AGI interfaces cause complex merge conflicts that are non-trivial to resolve correctly. - fix typos in commit messages - fix random whitespace damage - clean up log recovery changes to avoid passing both mp and pag around - clean up error handling for xfs_iunlink_lookup - remove unused tracepoints - remove stale references to potential dquot modifications Version 3: - https://lore.kernel.org/linux-xfs/20220627004336.217366-1-david@xxxxxxxxxxxxx/ - rebase on 5.19-rc2 - Note: has significant, non-trivial merge conflicts with xfs-perag-conv-5.20. Should probably be rebased on that branch to avoid these. Need to know what order these branches are going to get merged first. Version 2: - unpublished - rebase on 5.15-rc2 + xlog-intent-whiteouts - reverts changes made in V1 that moved to a list_head based double linked list. This requires on-disk format changes to allow the AGI to use just a single bucket and so is not generically applicable to all XFS filesystem formats. This version goes back to using prev/next agino values and radix tree lookups as the original RFC used. Version 1: - https://lore.kernel.org/linux-xfs/20200812092556.2567285-1-david@xxxxxxxxxxxxx/ - split up into many smaller patches - includes Xiang's single unlinked list bucket modification - uses a list_head for the in memory double unlinked inode list rather than aginos and lockless inode lookups - much simpler as it doesn't need to look up inodes from agino values - iunlink log item changed to take an xfs_inode pointer rather than an imap and agino values - a handful of small cleanups that breaking up into small patches allowed. [RFC] - https://lore.kernel.org/linux-xfs/20200623095015.1934171-1-david@xxxxxxxxxxxxx/