Re: ceph fs mv does copy, not move

Marc <Marc@xxxxxxxxxxxxxxxxx> · Fri, 25 Jun 2021 08:21:16 +0000

Adding to this. I can remember that I was surprised that a mv on cephfs between directories linked to different pools, only some meta(?) data was moved/changed and some data stayed still in the old pool. 
I am not sure if this is still the same in newer ceph versions, but I rather see data being moved completely. That is what everyone expects, regardless if this would take more time in this case between different pools.

> -----Original Message-----
> From: Frank Schilder <frans@xxxxxx>
> Sent: Thursday, 24 June 2021 17:34
> To: Patrick Donnelly <pdonnell@xxxxxxxxxx>
> Cc: ceph-users@xxxxxxx
> Subject:  Re: ceph fs mv does copy, not move
> 
> Dear Patrick,
> 
> thanks for letting me know.
> 
> Could you please consider to make this a ceph client mount option, for
> example, '-o fast_move', that enables a code path that enforces an mv to
> be a proper atomic mv with the risk that in some corner cases the target
> quota is overrun? With this option enabled, a move should either be a
> move or fail outright with "out of disk quota" (no partial move, no
> cp+rm at all). The fail should only occur if it is absolutely obvious
> that the target quota will be exceeded. Any corner cases are the
> responsibility of the operator. Application crashes due to incorrect
> error handling are acceptable.
> 
> Reasoning:
> 
> From a user's/operator's side, the preferred functionality is that in
> cases where a definite quota overrun can securely be detected in
> advance, the move should actually fail with "out of disk quota" instead
> of resorting to cp+rm, potentially leading to partial moves and a total
> mess for users/operators to clean up. In any other case, the quota
> should simply be ignored and the move should be a complete atomic move
> with the risk of exceeding the target quota and IO to stall. A temporary
> stall or fail of IO until the operator increases the quota again is, in
> my opinion and use case, highly preferable over the alternative of
> cp+rm. A quota or a crashed job is fast to fix, a partial move is not.
> 
> Some background:
> 
> We use ceph fs as an HPC home file system and as a back-end store. Being
> able to move data quickly across the entire file system is essential,
> because users re-factor their directory structure containing huge
> amounts of data quite often for various reasons.
> 
> On our system, we set file system quotas mainly for psychological
> reasons. We run a cron job that adjusts the quotas every day to show
> between 20% and 30% free capacity on the mount points. The psychological
> side here is to give an incentive to users to clean up temporary data.
> It is not intended to limit usage seriously, only to limit what can be
> done in between cron job runs as a safe-guard. The pool quotas set the
> real hard limits.
> 
> I'm in the process of migrating 100+TB right now and am really happy
> that I still have a client where I can do an O(1) move. It would be a
> disaster if I had now to use rsync or similar, which would take weeks.
> 
> Please, in such situations where developers seem to have to make a
> definite choice, consider the possibility of offering operators to
> choose the alternative that suits their use case best. Adding further
> options seems far better than limiting functionality in a way that
> becomes a terrible burden in certain, if not many use cases.
> 
> In ceph fs there have been many such decisions that allow for different
> answers from a user/operator perspective. For example, I would prefer if
> I could get rid of the attempted higher POSIX compliance level of ceph
> fs compared with Lustre, just disable all the client-caps and cache-
> coherence management and turn it into an awesome scale-out parallel file
> system. The attempt of POSIX compliant handling of simultaneous writes
> to files offers nothing to us, but costs huge in performance and forces
> users to move away from perfectly reasonable HPC work flows. Also, that
> it takes a TTL to expire before changes on one client become visible on
> another (unless direct_io is used for all IO) is perfectly acceptable
> for us given the potential performance gain due to simpler client-MDS
> communication.
> 
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> ________________________________________
> From: Patrick Donnelly <pdonnell@xxxxxxxxxx>
> Sent: 24 June 2021 05:29:45
> To: Frank Schilder
> Cc: ceph-users@xxxxxxx
> Subject: Re:  ceph fs mv does copy, not move
> 
> Hello Frank,
> 
> On Tue, Jun 22, 2021 at 2:16 AM Frank Schilder <frans@xxxxxx> wrote:
> >
> > Dear all,
> >
> > some time ago I reported that the kernel client resorts to a copy
> instead of move when moving a file across quota domains. I was told that
> the fuse client does not have this problem. If enough space is
> available, a move should be a move, not a copy.
> >
> > Today, I tried to move a large file across quota domains testing botn,
> the kernel- and the fuse client. Both still resort to a copy even though
> this issue was addressed quite a while ago
> (https://lists.ceph.io/hyperkitty/list/ceph-
> users@xxxxxxx/thread/44AEIHNEGKV4VGCARRTARGFZ264CR4T7/#XY7ZCE3KWHI4QSUNZ
> HDWL3QZQFOHXRQW). The versions I'm using are (CentOS 7)
> >
> > # yum list installed | grep ceph-fuse
> > ceph-fuse.x86_64                      2:13.2.10-0.el7
> @ceph
> >
> > # uname -r
> > 3.10.0-1160.31.1.el7.x86_64
> >
> > Any suggestions how to get this to work? I have to move directories
> containing 100+ TB.
> 
> ceph-fuse reverted this behavior in:
> https://tracker.ceph.com/issues/48203
> The kernel had a patch around that time too.
> 
> In summary, it was not possible to accurately account for the quota
> usage prior to doing the rename. Rather than allow a quota to
> potentially be massively overrun, we fell back to the old behavior of
> not allowing it.
> 
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Principal Software Engineer
> Red Hat Sunnyvale, CA
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx