Re: mds behind on trimming - replay until memory exhausted

Frank Schilder <frans@xxxxxx> · Fri, 5 Jun 2020 12:44:16 +0000

Hi Francois,

thanks for the link. The option "mds dump cache after rejoin" is for debugging purposes only. It will write the cache after rejoin to a file, but not drop the cache. This will not help you. I think this was implemented recently to make it possible to send a cache dump file to developers after an MDS crash before the restarting MDS changes the cache.

In your case, I would set osd_op_queue_cut_off during the next regular cluster service or upgrade.

My best bet right now is to try to add swap. Maybe someone else reading this has a better idea or you find a hint in one of the other threads.

Good luck!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Francois Legrand <fleg@xxxxxxxxxxxxxx>
Sent: 05 June 2020 14:34:06
To: Frank Schilder; ceph-users
Subject: Re:  mds behind on trimming - replay until memory exhausted

Le 05/06/2020 à 14:18, Frank Schilder a écrit :
> Hi Francois,
>
>> I was also wondering if setting mds dump cache after rejoin could help ?
> Haven't heard of that option. Is there some documentation?
I found it on :
https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/
mds dump cache after rejoin
Description
Ceph will dump MDS cache contents to a file after rejoining the cache
(during recovery).
Type
Boolean
Default
false

but I don't think it can help in my case, because rejoin occurs after
replay and in my case replay never ends !

>> I have :
>> osd_op_queue=wpq
>> osd_op_queue_cut_off=low
>> I can try to set osd_op_queue_cut_off to high, but it will be useful
>> only if the mds get active, true ?
> I think so. If you have no clients connected, there should not be queue priority issues. Maybe it is best to wait until your cluster is healthy again as you will need to restart all daemons. Make sure you set this in [global]. When I applied that change and after re-starting all OSDs my MDSes had reconnect issues until I set it on them too. I think all daemons use that option (the prefix osd_ is misleading).

For sure I would prefer not to restart all daemons because the second
filesystem is up and running (with production clients).

>> For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB
>> which seems reasonable for a mds server with 32/48GB).
> This sounds bad. 8GB should not cause any issues. Maybe you are hitting a bug, I believe there is a regression in Nautilus. There were recent threads on absurdly high memory use by MDSes. Maybe its worth searching for these in the list.
I will have a look.

>> I already force the clients to unmount (and even rebooted the ones from
>> which the rsync and the rmdir .snaps were launched).
> I don't know when the MDS acknowledges this. If is was a clean unmount (i.e. without -f or forced by reboot) the MDS should have dropped the clients already. If it was an unclean unmount it might not be that easy to get the stale client session out. However, I don't know about that.

Moreover when I did that, the mds was already not active but in replay,
so for sure the unmount was not acknowledged by any mds !

>> I think that providing more swap maybe the solution ! I will try that if
>> I cannot find another way to fix it.
> If the memory overrun is somewhat limited, this should allow the MDS to trim the logs. Will take a while, but it will do eventually.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Francois Legrand <fleg@xxxxxxxxxxxxxx>
> Sent: 05 June 2020 13:46:03
> To: Frank Schilder; ceph-users
> Subject: Re:  mds behind on trimming - replay until memory exhausted
>
> I was also wondering if setting mds dump cache after rejoin could help ?
>
>
> Le 05/06/2020 à 12:49, Frank Schilder a écrit :
>> Out of interest, I did the same on a mimic cluster a few months ago, running up to 5 parallel rsync sessions without any problems. I moved about 120TB. Each rsync was running on a separate client with its own cache. I made sure that the sync dirs were all disjoint (no overlap of files/directories).
>>
>> How many rsync processes are you running in parallel?
>> Do you have these settings enabled:
>>
>>     osd_op_queue=wpq
>>     osd_op_queue_cut_off=high
>>
>> WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the latter removed any behind trimming problems we have seen before.
>>
>> You are in a somewhat peculiar situation. I think you need to trim client caches, which requires an active MDS. If your MDS becomes active for at least some time, you could try the following (I'm not an expert here, so take with a grain of scepticism):
>>
>> - reduce the MDS cache memory limit to force recall of caps much earlier than now
>> - reduce client cach size
>> - set "osd_op_queue_cut_off=high" if not already done so, I think this requires restart of OSDs, so be careful
>>
>> At this point, you could watch your restart cycle to see if things improve already. Maybe nothing more is required.
>>
>> If you have good SSDs, you could try to provide temporarily some swap space. It saved me once. This will be very slow, but at least it might allow you to move forward.
>>
>> Harder measures:
>>
>> - stop all I/O from the FS clients, throw users out if necessary
>> - ideally, try to cleanly (!) shut down clients or force trimming the cache by either
>>     * umount or
>>     * sync; echo 3 > /proc/sys/vm/drop_caches
>>     Either of these might hang for a long time. Do not interrupt and do not do this on more than one client at a time.
>>
>> At some point, your active MDS should be able to hold a full session. You should then tune the cache and other parameters such that the MDSes can handle your rsync sessions.
>>
>> My experience is that MDSes overrun their cache limits quite a lot. Since I reduced mds_cache_memory_limit to not more than half of what is physically available, I have not had any problems again.
>>
>> Hope that helps.
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Francois Legrand <fleg@xxxxxxxxxxxxxx>
>> Sent: 05 June 2020 11:42:42
>> To: ceph-users
>> Subject:  mds behind on trimming - replay until memory exhausted
>>
>> Hi all,
>> We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and
>> 3 mds (1 active for each fs + one failover).
>> We are transfering all the datas (~600M files) from one FS (which was in
>> EC 3+2) to the other FS (in R3).
>> On the old FS we first removed the snapshots (to avoid strays problems
>> when removing files) and the ran some rsync deleting the files after the
>> transfer.
>> The operation should last a few weeks more to complete.
>> But few days ago, we started to have some warning mds behind on trimming
>> from the mds managing the old FS.
>> Yesterday, I restarted the active mds service to force the takeover by
>> the standby mds (basically because the standby is more powerfull and
>> have more memory, i.e 48GB over 32).
>> The standby mds took the rank 0 and started to replay... the mds behind
>> on trimming came back and the number of segments rised as well as the
>> memory usage of the server. Finally, it exhausted the memory of the mds
>> and the service stopped and the previous mds took rank 0 and started to
>> replay... until memory exhaustion and a new switch of mds etc...
>> It thus seems that we are in a never ending loop ! And of course, as the
>> mds is always in replay, the data are not accessible and the transfers
>> are blocked.
>> I stopped all the rsync and unmount the clients.
>>
>> My questions are :
>> - Does the mds trim during the replay so we could hope that after a
>> while it will purge everything and the mds will be able to become active
>> at the end ?
>> - Is there a way to accelerate the operation or to fix this situation ?
>>
>> Thanks for you help.
>> F.
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx