Hi all,
We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and
3 mds (1 active for each fs + one failover).
We are transfering all the datas (~600M files) from one FS (which was in
EC 3+2) to the other FS (in R3).
On the old FS we first removed the snapshots (to avoid strays problems
when removing files) and the ran some rsync deleting the files after the
transfer.
The operation should last a few weeks more to complete.
But few days ago, we started to have some warning mds behind on trimming
from the mds managing the old FS.
Yesterday, I restarted the active mds service to force the takeover by
the standby mds (basically because the standby is more powerfull and
have more memory, i.e 48GB over 32).
The standby mds took the rank 0 and started to replay... the mds behind
on trimming came back and the number of segments rised as well as the
memory usage of the server. Finally, it exhausted the memory of the mds
and the service stopped and the previous mds took rank 0 and started to
replay... until memory exhaustion and a new switch of mds etc...
It thus seems that we are in a never ending loop ! And of course, as the
mds is always in replay, the data are not accessible and the transfers
are blocked.
I stopped all the rsync and unmount the clients.
My questions are :
- Does the mds trim during the replay so we could hope that after a
while it will purge everything and the mds will be able to become active
at the end ?
- Is there a way to accelerate the operation or to fix this situation ?
Thanks for you help.
F.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx