Re: [RFC] Preparing for XFS reflink D-day

Amir Goldstein <amir73il@xxxxxxxxx> · Sun, 11 Dec 2016 21:23:38 +0200

On Sun, Dec 11, 2016 at 8:27 PM, Darrick J. Wong
<darrick.wong@xxxxxxxxxx> wrote:
> On Sun, Dec 11, 2016 at 10:38:21AM +0200, Amir Goldstein wrote:
>> On Sat, Dec 10, 2016 at 9:42 PM, Darrick J. Wong
>> <darrick.wong@xxxxxxxxxx> wrote:
>> > On Sat, Dec 10, 2016 at 10:04:39AM +0200, Amir Goldstein wrote:
>> ...
>> >
>> >> I realize that rmapbt/reflink features are declared unstable and
>> >> bugs could certainly be lurking without doing any reflinks at all.
>> >> However, I estimate the the class of bugs introduces by heavily
>> >> reflinked file systems is going to take more time to tame.
>> >
>> > Yes, probably.  It seems reasonably stable on a young FS, though we'll
>> > see how gracefully it ages.  There's probably mistakes in the ENOSPC
>> > handling since that seems to be everyone's Achilles heel.
>> >
>>
>> So we seem to be in agreement on the requirement.
>
> I'm willing to consider code to dynamically enable reflink, yes.
>

Well, if we can get a consensus on what should be supported
I can work on it and if you prefer to implement I will be happy to test.

>>
>> Good, so you are saying that the tool to enable refcount offline is already
>> available and I can basically choose option #2.
>> In that case, no further questions :-)
>
> Keep in mind that editing the filesystem with xfs_db and running
> xfs_repair to fill in the gaps is totally unsupported behavior!
>
> If you break it you get to keep all the pieces.
>
> I'd much, much, much rather have a properly engineered and tested
> upgrade path, which I guess we could do for reflink.
>

I'd much much much much rather that as well.

>> > If you're building your own kernels, you could just tweak
>> > xfs_reflink_remap_range with something like:
>> >
>> > if (!capable(CAP_SYS_ADMIN))
>> >         return -EOPNOTSUPP;
>> >
>> > so that only you (well, root) can make files share blocks.
>> >
>>
>> Sure, I know that :)
>> I am not the admin in this case though, I am the developer
>> who wants to prevent other developers and admins of
>> messing with reflink before it is ripe.
>> And let us not forget:
>> a76b5b0 fs: try to clone files first in vfs_copy_file_range
>> And what would happen when the nfsd on the systems try to
>> copy file range.
>
> <shrug> vfs_copy_file_range -> xfs_clone_file_range ->
> xfs_reflink_remap_range....
>

What I meant is that I could probably make sure there are no
obvious programs on our systems that issue a clone ioctl,
but nfsd which runs as root is going to be a source for
copy/clone requests from clients, so the
!capable(CAP_SYS_ADMIN) test is in sufficient
If I have to patch our systems I will add -onoreflink

>>
>> :-/ pre-allocate log space and AG space is an issue.
>> I can tweak mkfs.xfs to preallocate those for my use case,
>> but I am hoping that the need meets a bigger crowd and xfsprogs 4.9
>> would have a solution for that.
>
> In general, mkfs seems to create a log that's more than large enough to
> handle a dynamic increase in features.
>

So for large enough arrays I suppose that preallocating log space is not
an issue?

>> How about having mkfs.xfs 4.9 preallocate the space needed for
>> refcountbt if rmapbt=1? it's a bit of a hack, which Dave most probably
>> won't like, but it avoids the need to define a new recountbt=1 flag
>> just for the preallocation.
>
> Chances are pretty good there's enough space unless your fs is totally
> full, and if it's full then you might seriously consider a full
> backup/restore cycle onto a bigger disk to reduce fragmentation.
>

I though there was an issue with reserved space per AG and
that the amount of reserved space for btree blocks depends on the
features. If a single full AG is not an issue then never mind.

>>
>> The lesson is that if xfs_repair is able to de-refcount all blocks
>> (given sufficient disk space) and turn off the reflink feature and if
>> that functionality is well tested, then more users would have the
>> courage to enable reflink during its "beta" phase.
>
> Sure, but IIRC you could nuke all the corrupt snapshots by deleting the
> hidden snapshots file and releasing all the space it referenced back to
> the filesystem, which makes it easy to zap all the snapshots if
> something is amiss.
>
> Un-sharing an fs full of reflinked files requires us to build code to
> iterate every bmbt of every file (or to cross-reference every refcountbt
> record against the rmapbt to find the sharers) and then relocate the
> data, which is quite a bit more complex... and unnecessary since we can
> rebuild all the broken refcount metadata anyway.
>

You are right, of course, from technical POV, but psychologically, if people
know they have a safe way back to what they know and trust, it is easier
for them make the leap...

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html