On Thu, May 07, 2020 at 08:09:56AM -0700, Darrick J. Wong wrote: > On Thu, May 07, 2020 at 05:53:06AM -0400, Brian Foster wrote: > > On Wed, May 06, 2020 at 10:01:50AM -0700, Darrick J. Wong wrote: > > > On Wed, May 06, 2020 at 09:56:56AM -0400, Brian Foster wrote: > > > > On Tue, May 05, 2020 at 05:34:23PM -0700, Darrick J. Wong wrote: > > > > > On Tue, May 05, 2020 at 10:11:38AM -0400, Brian Foster wrote: > > > > > > On Mon, May 04, 2020 at 06:13:53PM -0700, Darrick J. Wong wrote: > > > > > > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > > > > > > > > > > > In xfs_bui_item_recover, there exists a use-after-free bug with regards > > > > > > > to the inode that is involved in the bmap replay operation. If the > > > > > > > mapping operation does not complete, we call xfs_bmap_unmap_extent to > > > > > > > create a deferred op to finish the unmapping work, and we retain a > > > > > > > pointer to the incore inode. > > > > > > > > > > > > > > Unfortunately, the very next thing we do is commit the transaction and > > > > > > > drop the inode. If reclaim tears down the inode before we try to finish > > > > > > > the defer ops, we dereference garbage and blow up. Therefore, create a > > > > > > > way to join inodes to the defer ops freezer so that we can maintain the > > > > > > > xfs_inode reference until we're done with the inode. > > > > > > > > > > > > > > Note: This imposes the requirement that there be enough memory to keep > > > > > > > every incore inode in memory throughout recovery. > > > > > > > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > > > > > > --- > > > > > > > > > > > > Maybe I'm missing something, but I thought the discussion on the > > > > > > previous version[1] landed on an approach where the intent would hold a > > > > > > reference to the inode. Wouldn't that break the dependency on the dfops > > > > > > freeze/thaw mechanism? > > > > > > > > > > We did, but as I started pondering how to make this work in practice, > > > > > I soon realized that it's not as simple as "don't irele the inode": > > > > > > > > > > (Part 1) > > > > > > > > > > It would be easy enough to add a flag to struct xfs_bmap_intent that > > > > > means "when you're done, unlock and irele the inode". You'd have to add > > > > > this capability to any log item that deals with inodes, but only > > > > > deferred bmap items need this. > > > > > > > > > > Log recovery looks like this: > > > > > > > > > > ---------------- > > > > > Transaction 0 > > > > > BUI1 unmap file1, offset1, startblock1, length == 200000 > > > > > ---------------- > > > > > <end> > > > > > > > > > > What happens if the first bunmapi call doesn't actually unmap > > > > > everything? We'll queue yet another bmap intent item (call it BUI2) to > > > > > finish the work. This means that we can't just drop the inode when > > > > > we're done processing BUI1; we have to propagate the "unlock and irele" > > > > > flag to BUI2. Ok, so now there's this chaining behavior that wasn't > > > > > there before. > > > > > > > > > > Then I thought about it some more, and wondered what would happen if the > > > > > log contained two unfinished bmap items for the same inode? If this > > > > > happened, you'd deadlock because you've not released the ILOCK for the > > > > > first unfinished bmap item when you want to acquire the ILOCK for the > > > > > second unfinished bmap item. > > > > > > > > > > > > > Hm, Ok.. so IIUC the lock is implicitly held across these map/unmap > > > > dfops at runtime, but that's not necessarily the case during recovery > > > > due to the dfops chaining behavior. To me, this raises the question of > > > > why we need to reorder/split these particular intents during recovery if > > > > the original operation would have completed under a single inode lock > > > > cycle. > > > > > > Log recovery /did/ function that way (no reordering) until > > > 509955823cc9c, when it was discovered that we don't have a good way to > > > reconstruct the dfops chain from the list of recovered log intent items. > > > > > > > The comments in xfs_log_recover.c don't really explain it. From going > > > > back to commit 509955823cc9c ("xfs: log recovery should replay deferred > > > > ops in order"), it looks like the issue is that ordering gets screwed up > > > > for operations that commit multiple intents per-transaction and > > > > completion of those intents creates new child intents. At runtime, we'd > > > > process the intents in the current transaction first and then get to the > > > > child intents in the order they were created. If we didn't have the > > > > recovery time dfops shuffling implemented in this patch, recovery would > > > > process the first intent in the log, then all of the child intents of > > > > that one before getting to the next intent in the log that presumably > > > > would have been second at runtime. Am I following that correctly? > > > > > > Correct. If we could figure out how to rebuild the dfops chain (and > > > thus the ordering requirements), we would be able to eliminate this > > > freezer stuff entirely. > > > > > > > Ok, thanks. I was wondering the same, but it's not clear how to do that > > with just the intent information in the log and big checkpoint > > transactions. Perhaps we could if we had a transactional id that was > > common across intents in a single transaction, or otherwise unique > > intent ids that would allow an intent to refer to the next in a chain at > > the time the intent was logged (such that recovery could learn to > > construct and process chains instead of just individual intents). I'm > > curious if we're going to want/need to solve that issue somehow or > > another more directly eventually as we grow more intents and perhaps > > come up with new and more complex intent-based operations (with more > > complex recovery scenarios). > > <nod> Too bad that's a log format change. :( > Yeah. It might be worth considering if we have an opportunity to break format in the future, but not useful for the current code. > > > > I think the rest makes sense from the perspective of not wanting to > > > > plumb the conditional locking complexity through the dfops/intent > > > > processing for every intent type that eventually needs it, as opposed to > > > > containing in the log recovery code (via dfops magic). I need to think > > > > about all that some more, but I'd like to make sure I understand why we > > > > have this recovery requirement in the first place. > > > > > > <nod> It took me a few hours to figure out why that patch was correct > > > the first time I saw it. It took me a few hours again(!) when I was > > > trying to write this series. :/ > > > > > > > If I follow correctly, the rmap swapext sequence makes for a good > > example because the bmap completions generate the rmap intents. A > > simplified (i.e. single extent remap) runtime transactional flow for > > that looks something like the following: > > > > t1: unmap intent, map intent > > t2: unmap done, map intent, runmap intent > > t3: map done, runmap intent, rmap intent > > t4: runmap done, rmap intent > > Work items that aren't touched in a particular transaction do not get > relogged, so this sequence is: > > t1: unmap(0) intent, map(1) intent > t2: unmap(0) done, runmap(2) intent > t3: map(1) done, rmap(3) intent > t4: runmap(2) done > t5: rmap(3) done > > (Maybe I need to take another look at the autorelogging patchset.) > Ah, I see. Thanks. Brian > > ... > > > > So at runtime these operations complete in the obvious linear order. > > Intents drop off the front as done items are logged and new intents are > > tacked on the end. If we crash between t2 and t3, however, we see t2 in > > the log and would end up completing the map intent and the subsequently > > generated rmap intent (that pops up in t3 at runtime) before we process > > the runmap intent that's still in the log (and was originally in the > > dfops queue for t2). I _think_ that's essentially the scenario described > > in 509955823cc9c, > > Yes... > > > but it's not clear to me if it's the same runtime example.. > > ...and you arrived at it with a different scenario. :) > > --D > > > Brian > > > > > --D > > > > > > > Brian > > > > > > > > > You'd have to drop the ILOCK (but retain the inode reference) between > > > > > the two bmap items. I think this is currently impossible because there > > > > > aren't any transactions that queue multiple bmap intents per > > > > > transaction, nobody else holds the ilock, and the existing rmap > > > > > implementation of the swapext ioctl doesn't allow swapping a file with > > > > > itself. However... > > > > > > > > > > (Part 2) > > > > > > > > > > The second thing I thought about is, what if there's a lot of work after > > > > > transaction 0? Say the recovery stream looks like: > > > > > > > > > > ---------------- > > > > > Transaction 0 > > > > > BUI1 unmap file1, offset1, startblock1, length == 200000 > > > > > ---------------- > > > > > <10000 more intents applied to other things> > > > > > ---------------- > > > > > <end> > > > > > > > > > > This is a problem, because we must drop the ILOCK after completing the > > > > > BUI1 work so that we can process the 10000 more intents before we can > > > > > get to BUI2. We cannot let the tail of the log get pinned on the locked > > > > > inode, so we must release the ILOCK after we're done with BUI1 and > > > > > re-acquire it when we're ready to deal with BUI2. > > > > > > > > > > This means I need /two/ flags in struct xfs_bmap_intent -- one to say > > > > > that we need to irele the inode at the end of processing, and another to > > > > > say that the bmap item needs to grab the ILOCK. If there's going to be > > > > > a BUI2 as a result of recovering BUI1, the first flag must be propagated > > > > > to BUI2, but the second flag must /not/ be propagated. > > > > > > > > > > A potential counterargument is that we crammed all 10000 intents into > > > > > the log without pinning the log, so maybe we can hold the ILOCK. > > > > > However.... > > > > > > > > > > (Part 3) > > > > > > > > > > Next, I considered how the swapext log item in the atomic file update > > > > > patchset would interact with log recovery. A couple of notes about the > > > > > swapext log items: > > > > > > > > > > 1. They are careful not to create the kinds of bunmap operations that > > > > > would cause the creation of /more/ BUIs. > > > > > > > > > > 2. Every time we log one extent's worth of unmap/remap operations (using > > > > > deferred bmap log items, naturally) we log an done item for the original > > > > > swapext intent item and immediately log another swapext intent for the > > > > > work remaining. > > > > > > > > > > I came up with the following sequence of transactions to recover: > > > > > > > > > > ---------------- > > > > > Transaction 0 > > > > > SXI0 file1, offset1, file2, offset2, length == 10 > > > > > ---------------- > > > > > Transaction 1 > > > > > BUI1 unmap file1, offset1, startblock1, length == 2 > > > > > BUI2 unmap file2, offset2, startblock2, length == 2 > > > > > BUI3 map file1, offset1, startblock2, length == 2 > > > > > BUI4 map file2, offset2, startblock1, length == 2 > > > > > SXD done (SXI0) > > > > > SXI5 file1, offset1 + 2, file2, offset2 + 2, length == 8 > > > > > --------------- > > > > > <end of log> > > > > > > > > > > Recovery in this case will end up with BUI[1-4] and SXI5 that still need > > > > > processing. Each of the 5 intent items gets its own recovery > > > > > transaction and dfops freezer chain. BUI1, BUI3, and SXI5 will each > > > > > require ilock'd access to file1, which means that each of them will have > > > > > to have the ability to drop the ILOCK but retain the reference. The > > > > > same argument applies to file2 and BUI2, BUI4, and SXI5. Note also that > > > > > in our brave new world, file1 and file2 can be the same. > > > > > > > > > > For the swapext log item type this inode management is particularly > > > > > acute because we're certain to have new SXIs created as a result of > > > > > recovering an SXI. > > > > > > > > > > (End) > > > > > > > > > > In conclusion, we need a mechanism to drop both the inode lock and the > > > > > inode reference. Each log item type that works with inodes will have to > > > > > set aside 2 flags somewhere in the incore extent structure, and provide > > > > > more or less the same code to manage that state. Right now that's just > > > > > the bmap items, but then the swapext items and the deferred xattr items > > > > > will have that same requirement when they get added. > > > > > > > > > > If we do that, the inode reference and ilock management diverges even > > > > > further from regular operations. Regular callers are expected to iget > > > > > and ilock the inode and maintain that all the way to the end of > > > > > xfs_trans_commit. Log recovery, on the other hand, will add a bunch of > > > > > state handling to some of the log items so that we can drop the ILOCK > > > > > after the first item, and then drop the inode reference after finishing > > > > > the last possible user of that inode in the revived dfops chain. > > > > > > > > > > On the other hand, keeping the inode management in the defer freezer > > > > > code means that I keep all that complexity and state management out of > > > > > the ->finish_item code. When we go to finish the new incore intents > > > > > that get created during recovery, the inode reference and ILOCK rules > > > > > are the same as anywhere else -- the caller is required to grab the > > > > > inode, the transaction, and the ilock (in that order). > > > > > > > > > > To the extent that recovering log intent items is /not/ the same as > > > > > running a normal transaction, all that weirdness is concentrated in > > > > > xfs_log_recover.c and a single function in xfs_defer.c. The desired > > > > > behavior is the same across all the log intent item types, so I > > > > > decided in favor of keeping the centralised behavior and (trying to) > > > > > contain the wacky. We don't need all that right away, but we will. > > > > > > > > > > Note that I did make it so that we retain the reference always, so > > > > > there's no longer a need for irele/igrab cycles, which makes capturing > > > > > the dfops chain less error prone. > > > > > > > > > > --D > > > > > > > > > > > Brian > > > > > > > > > > > > [1] https://lore.kernel.org/linux-xfs/20200429235818.GX6742@magnolia/ > > > > > > > > > > > > > fs/xfs/libxfs/xfs_defer.c | 50 +++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > fs/xfs/libxfs/xfs_defer.h | 10 +++++++++ > > > > > > > fs/xfs/xfs_bmap_item.c | 7 ++++-- > > > > > > > fs/xfs/xfs_icache.c | 19 +++++++++++++++++ > > > > > > > 4 files changed, 83 insertions(+), 3 deletions(-) > > > > > > > > > > > > > > > > > > > > > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c > > > > > > > index ea4d28851bbd..72933fdafcb2 100644 > > > > > > > --- a/fs/xfs/libxfs/xfs_defer.c > > > > > > > +++ b/fs/xfs/libxfs/xfs_defer.c > > > > > > > @@ -16,6 +16,7 @@ > > > > > > > #include "xfs_inode.h" > > > > > > > #include "xfs_inode_item.h" > > > > > > > #include "xfs_trace.h" > > > > > > > +#include "xfs_icache.h" > > > > > > > > > > > > > > /* > > > > > > > * Deferred Operations in XFS > > > > > > > @@ -583,8 +584,19 @@ xfs_defer_thaw( > > > > > > > struct xfs_defer_freezer *dff, > > > > > > > struct xfs_trans *tp) > > > > > > > { > > > > > > > + int i; > > > > > > > + > > > > > > > ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); > > > > > > > > > > > > > > + /* Re-acquire the inode locks. */ > > > > > > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > > > > > > + if (!dff->dff_inodes[i]) > > > > > > > + break; > > > > > > > + > > > > > > > + dff->dff_ilocks[i] = XFS_ILOCK_EXCL; > > > > > > > + xfs_ilock(dff->dff_inodes[i], dff->dff_ilocks[i]); > > > > > > > + } > > > > > > > + > > > > > > > /* Add the dfops items to the transaction. */ > > > > > > > list_splice_init(&dff->dff_dfops, &tp->t_dfops); > > > > > > > tp->t_flags |= dff->dff_tpflags; > > > > > > > @@ -597,5 +609,43 @@ xfs_defer_freeezer_finish( > > > > > > > struct xfs_defer_freezer *dff) > > > > > > > { > > > > > > > xfs_defer_cancel_list(mp, &dff->dff_dfops); > > > > > > > + xfs_defer_freezer_irele(dff); > > > > > > > kmem_free(dff); > > > > > > > } > > > > > > > + > > > > > > > +/* > > > > > > > + * Attach an inode to this deferred ops freezer. Callers must hold ILOCK_EXCL, > > > > > > > + * which will be dropped and reacquired when we're ready to thaw the frozen > > > > > > > + * deferred ops. > > > > > > > + */ > > > > > > > +int > > > > > > > +xfs_defer_freezer_ijoin( > > > > > > > + struct xfs_defer_freezer *dff, > > > > > > > + struct xfs_inode *ip) > > > > > > > +{ > > > > > > > + unsigned int i; > > > > > > > + > > > > > > > + ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); > > > > > > > + > > > > > > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > > > > > > + if (dff->dff_inodes[i] == ip) > > > > > > > + goto out; > > > > > > > + if (dff->dff_inodes[i] == NULL) > > > > > > > + break; > > > > > > > + } > > > > > > > + > > > > > > > + if (i == XFS_DEFER_FREEZER_INODES) { > > > > > > > + ASSERT(0); > > > > > > > + return -EFSCORRUPTED; > > > > > > > + } > > > > > > > + > > > > > > > + /* > > > > > > > + * Attach this inode to the freezer and drop its ILOCK because we > > > > > > > + * assume the caller will need to allocate a transaction. > > > > > > > + */ > > > > > > > + dff->dff_inodes[i] = ip; > > > > > > > + dff->dff_ilocks[i] = 0; > > > > > > > +out: > > > > > > > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > > > > + return 0; > > > > > > > +} > > > > > > > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h > > > > > > > index 7ae05e10d750..0052a0313283 100644 > > > > > > > --- a/fs/xfs/libxfs/xfs_defer.h > > > > > > > +++ b/fs/xfs/libxfs/xfs_defer.h > > > > > > > @@ -76,6 +76,11 @@ struct xfs_defer_freezer { > > > > > > > /* Deferred ops state saved from the transaction. */ > > > > > > > struct list_head dff_dfops; > > > > > > > unsigned int dff_tpflags; > > > > > > > + > > > > > > > + /* Inodes to hold when we want to finish the deferred work items. */ > > > > > > > +#define XFS_DEFER_FREEZER_INODES 2 > > > > > > > + unsigned int dff_ilocks[XFS_DEFER_FREEZER_INODES]; > > > > > > > + struct xfs_inode *dff_inodes[XFS_DEFER_FREEZER_INODES]; > > > > > > > }; > > > > > > > > > > > > > > /* Functions to freeze a chain of deferred operations for later. */ > > > > > > > @@ -83,5 +88,10 @@ int xfs_defer_freeze(struct xfs_trans *tp, struct xfs_defer_freezer **dffp); > > > > > > > void xfs_defer_thaw(struct xfs_defer_freezer *dff, struct xfs_trans *tp); > > > > > > > void xfs_defer_freeezer_finish(struct xfs_mount *mp, > > > > > > > struct xfs_defer_freezer *dff); > > > > > > > +int xfs_defer_freezer_ijoin(struct xfs_defer_freezer *dff, > > > > > > > + struct xfs_inode *ip); > > > > > > > + > > > > > > > +/* These functions must be provided by the xfs implementation. */ > > > > > > > +void xfs_defer_freezer_irele(struct xfs_defer_freezer *dff); > > > > > > > > > > > > > > #endif /* __XFS_DEFER_H__ */ > > > > > > > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c > > > > > > > index c733bdeeeb9b..bbce191d8fcd 100644 > > > > > > > --- a/fs/xfs/xfs_bmap_item.c > > > > > > > +++ b/fs/xfs/xfs_bmap_item.c > > > > > > > @@ -530,12 +530,13 @@ xfs_bui_item_recover( > > > > > > > } > > > > > > > > > > > > > > error = xlog_recover_trans_commit(tp, dffp); > > > > > > > - xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > > > > - xfs_irele(ip); > > > > > > > - return error; > > > > > > > + if (error) > > > > > > > + goto err_rele; > > > > > > > + return xfs_defer_freezer_ijoin(*dffp, ip); > > > > > > > > > > > > > > err_inode: > > > > > > > xfs_trans_cancel(tp); > > > > > > > +err_rele: > > > > > > > if (ip) { > > > > > > > xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > > > > xfs_irele(ip); > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > > > > index 17a0b86fe701..b96ddf5ff334 100644 > > > > > > > --- a/fs/xfs/xfs_icache.c > > > > > > > +++ b/fs/xfs/xfs_icache.c > > > > > > > @@ -12,6 +12,7 @@ > > > > > > > #include "xfs_sb.h" > > > > > > > #include "xfs_mount.h" > > > > > > > #include "xfs_inode.h" > > > > > > > +#include "xfs_defer.h" > > > > > > > #include "xfs_trans.h" > > > > > > > #include "xfs_trans_priv.h" > > > > > > > #include "xfs_inode_item.h" > > > > > > > @@ -1847,3 +1848,21 @@ xfs_start_block_reaping( > > > > > > > xfs_queue_eofblocks(mp); > > > > > > > xfs_queue_cowblocks(mp); > > > > > > > } > > > > > > > + > > > > > > > +/* Release all the inode resources attached to this freezer. */ > > > > > > > +void > > > > > > > +xfs_defer_freezer_irele( > > > > > > > + struct xfs_defer_freezer *dff) > > > > > > > +{ > > > > > > > + unsigned int i; > > > > > > > + > > > > > > > + for (i = 0; i < XFS_DEFER_FREEZER_INODES; i++) { > > > > > > > + if (!dff->dff_inodes[i]) > > > > > > > + break; > > > > > > > + > > > > > > > + if (dff->dff_ilocks[i]) > > > > > > > + xfs_iunlock(dff->dff_inodes[i], dff->dff_ilocks[i]); > > > > > > > + xfs_irele(dff->dff_inodes[i]); > > > > > > > + dff->dff_inodes[i] = NULL; > > > > > > > + } > > > > > > > +} > > > > > > > > > > > > > > > > > > > > > > > > > > > >