On Fri, Apr 28, 2017 at 12:40:32PM -0700, Darrick J. Wong wrote: > On Thu, Apr 27, 2017 at 12:35:19AM -0700, Christoph Hellwig wrote: > > On Wed, Apr 26, 2017 at 02:37:31PM -0700, Darrick J. Wong wrote: > > > That's so shockingly obvious I don't know why it didn't occur to me. > > > Uh, I'll go give that a try with that horrid generic/931 testcase that > > > I sent to the list a couple of weeks ago. :) > > ...and now I've realized why I never thought of it -- it's unfortunately > racy. We want to constrain the number of blocks that we unmap from the > file based on the refcount, but because the transaction rolling unlocks > the AGF in between the unmap and decrement-refcount transactions, > another process can swoop in and make changes to the sharedness that > screw us over. Let's say files A and B both share the same 1000 blocks, > and threads A and B are modifying files A and B, respectively. > > Thread A: Thread B: > 1. punch blocks backed by (0-1000) 1. punch blocks backed by (500-600) > 2. grab agf to read refcbt > 3. unmap blocks 0-1000; no > refcount edges so we can > unmap everything at once > 4. roll transaction, unlock agf > 2. grab agf to read refcbt > 3. unmap blocks 500-600; no refcount > edges so we can unmap everything in > a single operation > 4. roll transaction, unlock agf > 5. grab agf again. lucky! > 6. split refcbt record because 500-600 > is now only owned by file A. > 7. complete transaction, unlock agf > 5. grab agf to read refcbt > 6. whoah! why are there two > edges in the refcount tree at > 500 and 600??? > Ugh, ok. So we basically can't make any assumptions about shared extent state across the transactions, because that can change as soon as we start processing deferred actions. This is presumably why we have high level refcount inc/dec deferred ops such that the specific shared state of a range of blocks isn't really assessed until the deferred op runs. Therefore, this patch limits the high level unmap operation based on a worst case assumption that every other block is shared..? If we do end up going with this approach, I think the comment where we calc the max unmap block count could be slightly improved to point out the specifics of the issue here. E.g., each individual block of a contiguous extent could result in an independent EFI or reflink adjustment, therefore we need to convert available log reservation into a block based unmap limit. > So either we have to keep the AGF buffer locked and attached across > all the transaction rolls until we around to processing the refcount > decrement deferred op, or figure out something else clever...? > Keeping the AGF locked still sounds like the most preferable approach to me, provided there are no other dragons lurking around locking or whatnot and we can do so cleanly. Perhaps we could plumb in a mechanism to xfs_trans_bhold() a buffer across a dop list and then _bhold_release() it for the last transaction returned by xfs_defer_finish()..? Otherwise, another question I had more related to the approach of the current patch... could we use a deterministic unmap length baked into the transaction rather than a dynamic assumption that a certain percentage of the transaction is designated for reflink adjustments? E.g., use a transaction that is explicitly designed for a max extent modification count or a max block adjustment (based on the worst link reflink requirements) and fix xfs_bunmapi() to use that max count. What concerns me about the dynamic approach is that if we ever had to increase or modify this transaction res for something else that happens to also affect unmap (*handwave*), we'd have to also know to go and possibly reverse engineer/adjust this reflink heuristic appropriately (or more likely, not do so and set a landmine). Brian > --D > > > I like the idea too. But once thing to remember is that freeing > > blocks from two AG in the same transaction is deadlock prone as the > > other thread shows. I've been doing some work to mitigate that, but > > I guess I'll just wait for your new patch and will rebase on top of > > that. > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html