Re: ceph fs mv does copy, not move

Frank Schilder <frans@xxxxxx> · Thu, 24 Jun 2021 15:34:13 +0000

Dear Patrick,

thanks for letting me know.

Could you please consider to make this a ceph client mount option, for example, '-o fast_move', that enables a code path that enforces an mv to be a proper atomic mv with the risk that in some corner cases the target quota is overrun? With this option enabled, a move should either be a move or fail outright with "out of disk quota" (no partial move, no cp+rm at all). The fail should only occur if it is absolutely obvious that the target quota will be exceeded. Any corner cases are the responsibility of the operator. Application crashes due to incorrect error handling are acceptable.

Reasoning:

>From a user's/operator's side, the preferred functionality is that in cases where a definite quota overrun can securely be detected in advance, the move should actually fail with "out of disk quota" instead of resorting to cp+rm, potentially leading to partial moves and a total mess for users/operators to clean up. In any other case, the quota should simply be ignored and the move should be a complete atomic move with the risk of exceeding the target quota and IO to stall. A temporary stall or fail of IO until the operator increases the quota again is, in my opinion and use case, highly preferable over the alternative of cp+rm. A quota or a crashed job is fast to fix, a partial move is not.

Some background:

We use ceph fs as an HPC home file system and as a back-end store. Being able to move data quickly across the entire file system is essential, because users re-factor their directory structure containing huge amounts of data quite often for various reasons.

On our system, we set file system quotas mainly for psychological reasons. We run a cron job that adjusts the quotas every day to show between 20% and 30% free capacity on the mount points. The psychological side here is to give an incentive to users to clean up temporary data. It is not intended to limit usage seriously, only to limit what can be done in between cron job runs as a safe-guard. The pool quotas set the real hard limits.

I'm in the process of migrating 100+TB right now and am really happy that I still have a client where I can do an O(1) move. It would be a disaster if I had now to use rsync or similar, which would take weeks.

Please, in such situations where developers seem to have to make a definite choice, consider the possibility of offering operators to choose the alternative that suits their use case best. Adding further options seems far better than limiting functionality in a way that becomes a terrible burden in certain, if not many use cases.

In ceph fs there have been many such decisions that allow for different answers from a user/operator perspective. For example, I would prefer if I could get rid of the attempted higher POSIX compliance level of ceph fs compared with Lustre, just disable all the client-caps and cache-coherence management and turn it into an awesome scale-out parallel file system. The attempt of POSIX compliant handling of simultaneous writes to files offers nothing to us, but costs huge in performance and forces users to move away from perfectly reasonable HPC work flows. Also, that it takes a TTL to expire before changes on one client become visible on another (unless direct_io is used for all IO) is perfectly acceptable for us given the potential performance gain due to simpler client-MDS communication.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Patrick Donnelly <pdonnell@xxxxxxxxxx>
Sent: 24 June 2021 05:29:45
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  ceph fs mv does copy, not move

Hello Frank,

On Tue, Jun 22, 2021 at 2:16 AM Frank Schilder <frans@xxxxxx> wrote:
>
> Dear all,
>
> some time ago I reported that the kernel client resorts to a copy instead of move when moving a file across quota domains. I was told that the fuse client does not have this problem. If enough space is available, a move should be a move, not a copy.
>
> Today, I tried to move a large file across quota domains testing botn, the kernel- and the fuse client. Both still resort to a copy even though this issue was addressed quite a while ago (https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/44AEIHNEGKV4VGCARRTARGFZ264CR4T7/#XY7ZCE3KWHI4QSUNZHDWL3QZQFOHXRQW). The versions I'm using are (CentOS 7)
>
> # yum list installed | grep ceph-fuse
> ceph-fuse.x86_64                      2:13.2.10-0.el7               @ceph
>
> # uname -r
> 3.10.0-1160.31.1.el7.x86_64
>
> Any suggestions how to get this to work? I have to move directories containing 100+ TB.

ceph-fuse reverted this behavior in: https://tracker.ceph.com/issues/48203
The kernel had a patch around that time too.

In summary, it was not possible to accurately account for the quota
usage prior to doing the rename. Rather than allow a quota to
potentially be massively overrun, we fell back to the old behavior of
not allowing it.

--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx