Re: [PATCH v1 0/8] VFS: In-kernel copy system call

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2015-09-10 11:10, Anna Schumaker wrote:
On 09/09/2015 05:16 PM, Darrick J. Wong wrote:
On Wed, Sep 09, 2015 at 02:52:08PM -0400, Anna Schumaker wrote:
On 09/08/2015 06:39 PM, Darrick J. Wong wrote:
On Tue, Sep 08, 2015 at 02:45:39PM -0700, Andy Lutomirski wrote:
On Tue, Sep 8, 2015 at 2:29 PM, Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote:
On Tue, Sep 08, 2015 at 09:03:09PM +0100, Pádraig Brady wrote:
On 08/09/15 20:10, Andy Lutomirski wrote:
On Tue, Sep 8, 2015 at 11:23 AM, Anna Schumaker
<Anna.Schumaker@xxxxxxxxxx> wrote:
On 09/08/2015 11:21 AM, Pádraig Brady wrote:
I see copy_file_range() is a reflink() on BTRFS?
That's a bit surprising, as it avoids the copy completely.
cp(1) for example considered doing a BTRFS clone by default,
but didn't due to expectations that users actually wanted
the data duplicated on disk for resilience reasons,
and for performance reasons so that write latencies were
restricted to the copy operation, rather than being
introduced at usage time as the dest file is CoW'd.

If reflink() is a possibility for copy_file_range()
then could it be done optionally with a flag?

The idea is that filesystems get to choose how to handle copies in the
default case.  BTRFS could do a reflink, but NFS could do a server side

Eww, different default behaviors depending on the filesystem. :)

copy instead.  I can change the default behavior to only do a data copy
(unless the reflink flag is specified) instead, if that is desirable.

What does everybody think?

I think the best you could do is to have a hint asking politely for
the data to be deep-copied.  After all, some filesystems reserve the
right to transparently deduplicate.

Also, on a true COW filesystem (e.g. btrfs sometimes), there may be no
advantage to deep copying unless you actually want two copies for
locality reasons.

Agreed. The relink and server side copy are separate things.
There's no advantage to not doing a server side copy,
but as mentioned there may be advantages to doing deep copies on BTRFS
(another reason not previous mentioned in this thread, would be
to avoid ENOSPC errors at some time in the future).

So having control over the deep copy seems useful.
It's debatable whether ALLOW_REFLINK should be on/off by default
for copy_file_range().  I'd be inclined to have such a setting off by default,
but cp(1) at least will work with whatever is chosen.

So far it looks like people are interested in at least these "make data appear
in this other place" filesystem operations:

1. reflink
2. reflink, but only if the contents are the same (dedupe)

What I meant by this was: if you ask for "regular copy", you may end
up with a reflink anyway.  Anyway, how can you reflink a range and
have the contents *not* be the same?

reflink forcibly remaps fd_dest's range to fd_src's range.  If they didn't
match before, they will afterwards.

dedupe remaps fd_dest's range to fd_src's range only if they match, of course.

Perhaps I should have said "...if the contents are the same before the call"?


3. regular copy
4. regular copy, but make the hardware do it for us
5. regular copy, but require a second copy on the media (no-dedupe)

If this comes from me, I have no desire to ever use this as a flag.

I meant (5) as a "disable auto-dedupe for this operation" flag, not as
a "reallocate all the shared blocks now" op...

If someone wants to use chattr or some new operation to say "make this
range of this file belong just to me for purpose of optimizing future
writes", then sure, go for it, with the understanding that there are
plenty of filesystems for which that doesn't even make sense.

"Unshare these blocks" sounds more like something fallocate could do.

So far in my XFS reflink playground, it seems that using the defrag tool to
un-cow a file makes most sense.  AFAICT the XFS and ext4 defraggers copy a
fragmented file's data to a second file and use a 'swap extents' operation,
after which the donor file is unlinked.

Hey, if this syscall turns into a more generic "do something involving two
(fd:off:len) (fd:off:len) tuples" call, I guess we could throw in "swap
extents" as a 7th operation, to refactor the ioctls.  <smirk>


6. regular copy, but don't CoW (eatmyothercopies) (joke)

(Please add whatever ops I missed.)

I think I can see a case for letting (4) fall back to (3) since (4) is an
optimization of (3).

However, I particularly don't like the idea of (1) falling back to (3-5).
Either the kernel can satisfy a request or it can't, but let's not just
assume that we should transmogrify one type of request into another.  Userspace
should decide if a reflink failure should turn into one of the copy variants,
depending on whether the user wants to spread allocation costs over rewrites or
pay it all up front.  Also, if we allow reflink to fall back to copy, how do
programs find out what actually took place?  Or do we simply not allow them to
find out?

Also, programs that expect reflink either to finish or fail quickly might be
surprised if it's possible for reflink to take a longer time than usual and
with the side effect that a deep(er) copy was made.

I guess if someone asks for both (1) and (3) we can do the fallback in the
kernel, like how we handle it right now.


I think we should focus on what the actual legit use cases might be.
Certainly we want to support a mode that's "reflink or fail".  We
could have these flags:

COPY_FILE_RANGE_ALLOW_REFLINK
COPY_FILE_RANGE_ALLOW_COPY

Setting neither gets -EINVAL.  Setting both works as is.  Setting just
ALLOW_REFLINK will fail if a reflink can't be supported.  Setting just
ALLOW_COPY will make a best-effort attempt not to reflink but
expressly permits reflinking in cases where either (a) plain old
write(2) might also result in a reflink or (b) there is no advantage
to not reflinking.

I don't agree with having a 'copy' flag that can reflink when we also have a
'reflink' flag.  I guess I just don't like having a flag with different
meanings depending on context.

Users should be able to get the default behavior by passing '0' for flags, so
provide FORBID_REFLINK and FORBID_COPY flags to turn off those behaviors, with
an admonishment that one should only use them if they have a goooood reason.
Passing neither gets you reflink-xor-copy, which is what I think we both want
in the general case.

I agree here that 0 for flags should do something useful, and I wanted to
double check if reflink-xor-copy is a good default behavior.

Ok.


FORBID_REFLINK = 1
FORBID_COPY = 2

I don't like the idea of using flags to forbid behavior.  I think it would be
more straightforward to have flags like REFLINK_ONLY or COPY_ONLY so users
can tell us what they want, instead of what they don't want.

Seems fine to me.

While I'm thinking about flags, COPY_FILE_RANGE_REFLINK_ONLY would be a bit
of a mouthful.  Does anybody have suggestions for ways that I could make this
shorter?

CFR_REFLINK_ONLY?

That could work!  Although I might do as Austin suggests and drop the _ONLY part, and then make the man page clear about what's going on.

Would you expect to trigger a NFS server side copy by passing the pagecache copy flag?  Or would that only happen if I pass flags=0?
Personally, I would think that an NFS server side copy could be counted under the 'hardware assisted' flag. From the point of view of an NFS client, the NFS server is a (usually) opaque piece of storage hardware, similar to a local disk drive in that you pass commands to it and get responses, the only real difference is that NFS is a much higher level protocol than for example SCSI.

--D


Thanks,
Anna

CHECK_SAME = 4
HW_COPY = 8

DEDUPE = (FORBID_COPY | CHECK_SAME)

What do you say to that?

An example of (b) would be a filesystem backed by deduped
thinly-provisioned storage that can't do anything about ENOSPC because
it doesn't control it in the first place.

Another option would be to split up the copy case into "I expect to
overwrite a lot of the target file soon, so (c) try to commit space
for that or (d) try to make it time-efficient".  Of course, (d) is
irrelevant on filesystems with no random access (nvdimms, for
example).

I guess the tl;dr is that I'm highly skeptical of any use for
disallowing reflinking other than forcibly committing space in cases
where committing space actually means something.

That's more or less where I was going too. :)

--D


--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Attachment: smime.p7s
Description: S/MIME Cryptographic Signature


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux