On Tue, Jan 05, 2016 at 07:42:26AM -0500, Brian Foster wrote: > On Mon, Jan 04, 2016 at 03:59:51PM -0800, Darrick J. Wong wrote: > > On Sun, Dec 20, 2015 at 09:02:54AM -0500, Brian Foster wrote: > > > On Sat, Dec 19, 2015 at 12:56:23AM -0800, Darrick J. Wong wrote: > > > > Hi all, > > > > > > > ... > > > > Fixed since RFCv3: > > > > > > > > * The reflink and dedupe ioctls are being hoisted to the VFS, as > > > > provided in the first few patches. Patch 81 connects to this > > > > functionality. > > > > > > > > * Copy on write has been rewritten for v4. We now use the existing > > > > delayed allocation mechanism to coalesce writes together, deferring > > > > allocation until writeout time. This enables CoW to make better > > > > block placement decisions and significantly reduces overhead. > > > > CoW is still pretty slow, but not as slow as before. > > > > > > > > * Direct IO CoW has been implemented using the same mechanism as > > > > above, but modified to perform the allocation and remapping right > > > > then and there. Throughput is much higher than pushing data > > > > through the page cache CoW. (It's the same mechanism, but we're > > > > playing with chunks bigger than a single memory page.) > > > > > > > > * CoW ENOSPC works correctly now, except in the pathological case > > > > that the AG fills up and the rmap btree cannot expand. That will > > > > be addressed for v5. > > > > > > > > * fallocate will now unshare blocks to prevent future ENOSPC, as > > > > you'd expect. > > > > > > > > * refcount btree blocks are preallocated at mount time to prevent > > > > ENOSPC while trying to expand the tree. This also has the effect > > > > of grouping the btree blocks together, which can speed up CoW > > > > remapping. > > > > > > > > > > Can you elaborate on how these blocks are preallocated? E.g., is the > > > tree "preconstructed" in some sense? However that is done, is this the > > > anticipated solution or a temporary workaround..? > > > > > > Also, shouldn't the enospc condition be handled by the agfl? I take it > > > there is something going on here that renders that solution flawed, so > > > I'm just curious what it is. > > > > > > (Sorry if this is all explained elsewhere, but I haven't yet had a > > > chance to take a close enough look at this feature..). > > > > Reference count btree blocks aren't allocated from the AGFL; they're allocated > > from the free space in the same manner as the inobt, per a review comment from > > Dave a looong time ago. :) > > > > Ah, Ok. > > > As such, we can get ourselves into the nasty situation where every block in the > > AG has been allocated to file data. If we then see a bunch of reference count > > changes that are scattered around the AG, the reference count btree has to > > expand to hold all the new records... but there isn't space, and the operation > > fails. Given that we know the maximum possible size of the refcount btree > > (it's 0.3% of the AG size with 4k blocks), I figured it was easy enough to > > avoid ENOSPC for reflink operations. > > > > Sounds reasonable. > > > I've temporarily fixed this by adding code that figures out how many blocks we > > need if the reference count btree has to have a unique record for every block > > in the AG and holding that many blocks until either they're allocated to the > > refcount btree or freed at umount time. Right now it's a temporary fix (if the > > FS crashes, the reserved blocks are lost) but it wouldn't be difficult for the > > FS to make a permanent reservation that's recorded on disk somehow. But that's > > involves writing things to disk + making xfsprogs understand the reservation; > > let's see what people say about the reserved pool idea at all. > > > > Does that make sense? :) > > > > Yep, it sounds sort of like the reserve pool mechanism used to protect > against ENOSPC when freeing blocks. Curious... why are the reserved > blocks lost on fs crash? Wouldn't they be reserved again on the > subsequent mount? They will, but the pre-crash reservation isn't (yet) written down anywhere on disk. Thank /you/ for having a look at the reflink code! :) --D > > Thanks for the explanation... > > Brian > > > --D > > > > > > > > Brian > > > > > > > Issues: > > > > > > > > * The extent swapping ioctl still allocates a bigger fixed-size > > > > transaction. That's most likely a stupid thing to do, so getting a > > > > better grip on how the journalling code works and auditing all the > > > > new transaction users will have to happen. Right now it mostly > > > > gets lucky. > > > > > > > > * EFI tracking for the allocated-but-not-yet-mapped blocks is > > > > nonexistant. A crash will leak them. > > > > > > > > * ENOSPC while expanding the rmap btree can crash the FS. For now we > > > > work around this problem by making the AGFL as big as possible, > > > > failing CoW attempts with ENOSPC if there aren't enough AGFL blocks > > > > available, and hoping that doesn't actually happen. > > > > > > > > If you're going to start using this mess, you probably ought to just > > > > pull from my github trees for kernel[1], xfsprogs[2], and xfstests[3]. > > > > There are also updates for xfs-docs[4] and man-pages[5]. > > > > > > > > The patches have been xfstested with x64, i386, and ppc64; while in > > > > general the tests run to completion, there are still periodic bugs > > > > that will be addressed by the next RFC. There's a persistent crash on > > > > arm64 and ppc64el that I haven't been able to triage. > > > > > > > > This is an extraordinary way to eat your data. Enjoy! > > > > Comments and questions are, as always, welcome. > > > > > > > > --D > > > > > > > > [1] https://github.com/djwong/linux/tree/for-dave > > > > [2] https://github.com/djwong/xfsprogs/tree/for-dave > > > > [3] https://github.com/djwong/xfstests/tree/for-dave > > > > [4] https://github.com/djwong/xfs-documentation/tree/for-dave > > > > [5] https://github.com/djwong/man-pages/commits/for-mtk > > > > > > > > _______________________________________________ > > > > xfs mailing list > > > > xfs@xxxxxxxxxxx > > > > http://oss.sgi.com/mailman/listinfo/xfs > > > > > > _______________________________________________ > > > xfs mailing list > > > xfs@xxxxxxxxxxx > > > http://oss.sgi.com/mailman/listinfo/xfs > > _______________________________________________ > xfs mailing list > xfs@xxxxxxxxxxx > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs