Re: mds behind on trimming - replay until memory exhausted

Francois Legrand <fleg@xxxxxxxxxxxxxx> · Fri, 5 Jun 2020 13:46:03 +0200

I was also wondering if setting mds dump cache after rejoin could help ?

Le 05/06/2020 à 12:49, Frank Schilder a écrit :
Out of interest, I did the same on a mimic cluster a few months ago, running up to 5 parallel rsync sessions without any problems. I moved about 120TB. Each rsync was running on a separate client with its own cache. I made sure that the sync dirs were all disjoint (no overlap of files/directories).

How many rsync processes are you running in parallel?
Do you have these settings enabled:

   osd_op_queue=wpq
   osd_op_queue_cut_off=high

WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the latter removed any behind trimming problems we have seen before.

You are in a somewhat peculiar situation. I think you need to trim client caches, which requires an active MDS. If your MDS becomes active for at least some time, you could try the following (I'm not an expert here, so take with a grain of scepticism):

- reduce the MDS cache memory limit to force recall of caps much earlier than now
- reduce client cach size
- set "osd_op_queue_cut_off=high" if not already done so, I think this requires restart of OSDs, so be careful

At this point, you could watch your restart cycle to see if things improve already. Maybe nothing more is required.

If you have good SSDs, you could try to provide temporarily some swap space. It saved me once. This will be very slow, but at least it might allow you to move forward.

Harder measures:

- stop all I/O from the FS clients, throw users out if necessary
- ideally, try to cleanly (!) shut down clients or force trimming the cache by either
   * umount or
   * sync; echo 3 > /proc/sys/vm/drop_caches
   Either of these might hang for a long time. Do not interrupt and do not do this on more than one client at a time.

At some point, your active MDS should be able to hold a full session. You should then tune the cache and other parameters such that the MDSes can handle your rsync sessions.

My experience is that MDSes overrun their cache limits quite a lot. Since I reduced mds_cache_memory_limit to not more than half of what is physically available, I have not had any problems again.

Hope that helps.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Francois Legrand <fleg@xxxxxxxxxxxxxx>
Sent: 05 June 2020 11:42:42
To: ceph-users
Subject:  mds behind on trimming - replay until memory exhausted

Hi all,
We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and
3 mds (1 active for each fs + one failover).
We are transfering all the datas (~600M files) from one FS (which was in
EC 3+2) to the other FS (in R3).
On the old FS we first removed the snapshots (to avoid strays problems
when removing files) and the ran some rsync deleting the files after the
transfer.
The operation should last a few weeks more to complete.
But few days ago, we started to have some warning mds behind on trimming
from the mds managing the old FS.
Yesterday, I restarted the active mds service to force the takeover by
the standby mds (basically because the standby is more powerfull and
have more memory, i.e 48GB over 32).
The standby mds took the rank 0 and started to replay... the mds behind
on trimming came back and the number of segments rised as well as the
memory usage of the server. Finally, it exhausted the memory of the mds
and the service stopped and the previous mds took rank 0 and started to
replay... until memory exhaustion and a new switch of mds etc...
It thus seems that we are in a never ending loop ! And of course, as the
mds is always in replay, the data are not accessible and the transfers
are blocked.
I stopped all the rsync and unmount the clients.

My questions are :
- Does the mds trim during the replay so we could hope that after a
while it will purge everything and the mds will be able to become active
at the end ?
- Is there a way to accelerate the operation or to fix this situation ?

Thanks for you help.
F.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx