Re: mds behind on trimming - replay until memory exhausted

Francois Legrand <fleg@xxxxxxxxxxxxxx> · Tue, 9 Jun 2020 22:20:29 +0200

Hi,
Actually I let the mds managing the damaged filesystem as it is because 
the files can be read (despite of the warning and errors). Thus I 
restarted the rsyncs to transfer everything to the new filesystem (thus 
on different PG because it's a different cephfs with different pools) 
but without deleting the olds files to avoid killing definitively the 
old mds and the old fs. The number of segment is then more or less 
stable (very high ~123611 but not increasing too much).
I guess that we will have enought space to copy the remaining datas (it 
will be short but I think it will pass). Once everything will be 
transfered and checked, I will destroy the old FS and the damaged pool.
F.

Le 09/06/2020 à 19:50, Frank Schilder a écrit :
Looks like an answer to your other thread takes its time.

Is it a possible option for you to

- copy all readable files using this PG to some other storage,
- remove or clean up the broken PG and
- copy the files back in?

This might lead to a healthy cluster. I don't know a proper procedure though. Somehow the ceph fs must play along as files using this will also use other PGs and get partly broken.

Have you found other options?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Francois Legrand<fleg@xxxxxxxxxxxxxx>
Sent: 08 June 2020 16:38:18
To: Frank Schilder; ceph-users
Subject: Re:  Re: mds behind on trimming - replay until memory exhausted

I already had some discussion on the list about this problem. But I
should ask again.
We really lost some objects and there are not enought shards to
reconstruct them (it's an erasure coding data pool)... so it cannot be
fixed anymore and we know we have data loss ! I did not marked the PG
out because there are still some parts (objects) which are still present
and we hope to be able to copy them and save a few bytes more ! It would
be great to be able to flush only broken objects, but I don't know how
to do that, even if it's possible !
I thus run some cephfs-data-scan pg_files to identify the files with
data on this pg and the I run a grep -q -m 1 "." "/path_to_damaged_file"
to identify the ones which are really empty (we tested different way to
do this and it seems that's the fastest).
F.

Le 08/06/2020 à 16:07, Frank Schilder a écrit :
OK, now we are talking. It is very well possible that trimming will not start until this operation is completed.

If there are enough shards/copies to recover the lost objects, you should try a pg repair first. If you did loose too many replicas, there are ways to flush this PG out of the system. You will loose data this way. I don't know how to repair or flush only broken objects out of a PG, but would hope that this is possible.

Before you do anything destructive, open a new thread in this list specifically for how to repair/remove this PG with the least possible damage.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Francois Legrand<fleg@xxxxxxxxxxxxxx>
Sent: 08 June 2020 16:00:28
To: Frank Schilder; ceph-users
Subject: Re:  Re: mds behind on trimming - replay until memory exhausted

There is no recovery going on, but indeed we have a pg damaged (with
some lost objects due to a major crash few weeks ago)... and there are
some shards of this pg on osd 27 !
That's also why we are migrating all the data out of this FS !
It's certainly related and I guess that  it's trying to remove some
datas that are already lost and it get stuck ! I don't know if there is
a way to tell ceph to forget about these ops ! I guess no.
I thus think that there is not that much to do apart from reading as
much data as we can to save as much as possible.
F.

Le 08/06/2020 à 15:48, Frank Schilder a écrit :
That's strange. Maybe there is another problem. Do you have any other health warnings that might be related? Is there some recovery/rebalancing going on?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Francois Legrand<fleg@xxxxxxxxxxxxxx>
Sent: 08 June 2020 15:27:59
To: Frank Schilder; ceph-users
Subject: Re:  Re: mds behind on trimming - replay until memory exhausted

Thanks again for the hint !
Indeed, I did a
ceph daemon  mds.lpnceph-mds02.in2p3.fr objecter_requests
and it seems that osd 27 is more or less stuck with op of age 34987.5
(while others osd have ages < 1).
I tryed a ceph osd down 27 which resulted in reseting the age but I can
notice that age for osd.27 ops is rising again.
I think I will restart it (btw our osd servers and mds are different
machines).
F.

Le 08/06/2020 à 15:01, Frank Schilder a écrit :
Hi Francois,

this sounds great. At least its operational. I guess it is still using a lot of swap while trying to replay operations.

I would disconnect cleanly all clients if you didn't do so already, even any read-only clients. Any extra load will just slow down recovery. My best guess is, that the MDS is replaying some operations, which is very slow due to swap. While doing so, the segments to trim will probably keep increasing for a while until it can start trimming.

The slow meta-data IO is an operation hanging in some OSD. You should check which OSD it is (ceph health detail) and check if you can see the operation in the OSDs OPS queue. I would expect this OSD to have a really long OPS queue. I have seen meta-data operations hang for a long time. In case this OSD runs on the same server as your MDS, you will probably have to sit it out.

If the meta-data operation is the only operation in the queue, the OSD might need a restart. But be careful, if in doubt ask the list first.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Francois Legrand<fleg@xxxxxxxxxxxxxx>
Sent: 08 June 2020 14:45:13
To: Frank Schilder; ceph-users
Subject: Re:  Re: mds behind on trimming - replay until memory exhausted

Hi Franck,
Finally I dit :
ceph config set global mds_beacon_grace 600000
and create /etc/sysctl.d/sysctl-ceph.conf with
vm.min_free_kbytes=4194303
and then
sysctl --system

After that, the mds went to rejoin for a very long time (almost 24
hours) with errors like :
2020-06-07 04:10:36.802 7ff866e2e700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-06-07 04:10:36.802 7ff866e2e700  0
mds.beacon.lpnceph-mds02.in2p3.fr Skipping beacon heartbeat to monitors
(last acked 14653.8s ago); MDS internal heartbeat is not healthy!
2020-06-07 04:10:37.021 7ff868e32700 -1 monclient: _check_auth_rotating
possible clock skew, rotating keys expired way too early (before
2020-06-07 03:10:37.022271)
and also
2020-06-07 04:10:44.942 7ff86d63b700  0 auth: could not find secret_id=10363
2020-06-07 04:10:44.942 7ff86d63b700  0 cephx: verify_authorizer could
not get service secret for service mds secret_id=10363

but at the end the mds went active ! :-)
I let it at rest from sunday afternoon until this morning.
Indeed I was able to connect clients (in read-only for now) and read the
datas.
I checked the clients connected with ceph tell
mds.lpnceph-mds02.in2p3.fr client ls
and disconnected the few clients still there (with umount) and checked
that they were not connected anymore with the same command.
But I still have the following warnings
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
         mdslpnceph-mds02.in2p3.fr(mds.0): 1 slow metadata IOs are blocked >
30 secs, oldest blocked for 75372 secs
MDS_TRIM 1 MDSs behind on trimming
         mdslpnceph-mds02.in2p3.fr(mds.0): Behind on trimming (122836/128)
max_segments: 128, num_segments: 122836

and the number of segments is still rising (slowly).
F.

Le 08/06/2020 à 12:00, Frank Schilder a écrit :
Hi Francois,

did you manage to get any further with this?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder<frans@xxxxxx>
Sent: 06 June 2020 15:21:59
To: ceph-users;fleg@xxxxxxxxxxxxxx
Subject:  Re: mds behind on trimming - replay until memory exhausted

I think you have a problem similar to one I have. The priority of beacons seems very low. As soon as something gets busy, beacons are ignored or not sent. This was part of your log messages from the MDS. It stopped reporting to the MONs due to laggy connection. This laggyness is a result of swapping:

2020-06-05 21:39:06.015 7f251bfe6700  1 mds.0.322900 skipping upkeep
work because connection to Monitors appears laggy
Hence, during the (entire) time you are trying to get the MDS back using swap, it will almost certainly stop sending beacons. Therefore, you need to disable the time-out temporarily, otherwise the MON will always kill it for no real reason. The time-out should be long enough to cover the entire recovery period.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Francois Legrand<fleg@xxxxxxxxxxxxxx>
Sent: 06 June 2020 11:11
To: Frank Schilder; ceph-users
Subject: Re:  Re: mds behind on trimming - replay until memory exhausted

Thanks for the tip,
I will try that. For now vm.min_free_kbytes = 90112
Indeed, yesterday after your last mail I set mds_beacon_grace to 240.0
but this didn't change anything...
         -27> 2020-06-06 06:15:07.373 7f83e3626700  1
mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to
be laggy; 332.044s since last acked beacon
Which is the same time since last acked beacon I had before changing the
parameter.
As mds beacon interval is 4 s setting mds_beacon_grace to 240 should
lead to 960 s (16mn).  Thus I think that the bottleneck is elsewhere.
F.

Le 06/06/2020 à 09:47, Frank Schilder a écrit :
Hi Francois,

there is actually one more parameter you might consider changing in case the MDS gets kicked out again. For a system under such high memory pressure, the value for the kernel parameter vm.min_free_kbytes might need adjusting. You can check the current value with

sysctl vm.min_free_kbytes

In your case with heavy swap usage, this value should probably be somewhere between 2-4GB.

Careful, do not change this value while memory is in high demand. If not enough memory is available, setting this will immediately OOM kill your machine. Make sure that plenty of pages are unused. Drop page cache if necessary or reboot the machine before setting this value.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder<frans@xxxxxx>
Sent: 06 June 2020 00:36:13
To: ceph-users;fleg@xxxxxxxxxxxxxx
Subject:  Re: mds behind on trimming - replay until memory exhausted

Hi Francois,

yes, the beacon grace needs to be higher due to the latency of swap. Not sure if 60s will do. For this particular recovery operation, you might want to go much higher (1h) and watch the cluster health closely.

Good luck and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Francois Legrand<fleg@xxxxxxxxxxxxxx>
Sent: 05 June 2020 23:51:04
To: Frank Schilder; ceph-users
Subject: Re:  mds behind on trimming - replay until memory exhausted

Hi,
Unfortunately adding swap did not solve the problem !
I added 400 GB of swap. It used about 18GB of swap after consuming all
the ram and stopped with the following logs :

2020-06-05 21:33:31.967 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr
Updating MDS map to version 324691 from mon.1
2020-06-05 21:33:40.355 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr
Updating MDS map to version 324692 from mon.1
2020-06-05 21:33:59.787 7f251b7e5700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-06-05 21:33:59.787 7f251b7e5700  0
mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors
(last acked 3.99979s ago); MDS internal heartbeat is not healthy!
2020-06-05 21:34:00.287 7f251b7e5700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-06-05 21:34:00.287 7f251b7e5700  0
mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors
(last acked 4.49976s ago); MDS internal heartbeat is not healthy!
....
2020-06-05 21:39:05.991 7f251bfe6700  1 heartbeat_map reset_timeout
'MDSRank' had timed out after 15
2020-06-05 21:39:06.015 7f251bfe6700  1
mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to
be laggy; 310.228s since last acked beacon
2020-06-05 21:39:06.015 7f251bfe6700  1 mds.0.322900 skipping upkeep
work because connection to Monitors appears laggy
2020-06-05 21:39:19.838 7f251bfe6700  1 mds.0.322900 skipping upkeep
work because connection to Monitors appears laggy
2020-06-05 21:39:19.869 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr
Updating MDS map to version 324694 from mon.1
2020-06-05 21:39:19.869 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr Map
removed me (mds.-1 gid:210070681) from cluster due to lost contact;
respawning
2020-06-05 21:39:19.870 7f251e7eb700  1 mds.lpnceph-mds04.in2p3.fr respawn!
--- begin dump of recent events ---
        -9999> 2020-06-05 19:28:07.982 7f25217f1700  5
mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq
2131 rtt 0.930951
        -9998> 2020-06-05 19:28:11.053 7f251b7e5700  5
mds.beacon.lpnceph-mds04.in2p3.fr Sending beacon up:replay seq 2132
        -9997> 2020-06-05 19:28:11.053 7f251b7e5700 10 monclient:
_send_mon_message to mon.lpnceph-mon02 at v2:134.158.152.210:3300/0
        -9996> 2020-06-05 19:28:12.176 7f25217f1700  5
mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq
2132 rtt 1.12294
        -9995> 2020-06-05 19:28:12.176 7f251e7eb700  1
mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 323302 from mon.1
        -9994> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: tick
        -9993> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient:
_check_auth_rotating have uptodate secrets (they expire after 2020-06-05
19:27:44.290995)
...
2020-06-05 21:39:31.092 7f3c4d5e3700  1 mds.lpnceph-mds04.in2p3.fr
Updating MDS map to version 324749 from mon.1
2020-06-05 21:39:35.257 7f3c4d5e3700  1 mds.lpnceph-mds04.in2p3.fr
Updating MDS map to version 324750 from mon.1
2020-06-05 21:39:35.257 7f3c4d5e3700  1 mds.lpnceph-mds04.in2p3.fr Map
has assigned me to become a standby

However, the mons doesn't seems particularly loaded !
So I am trying to set mds_beacon_grace to 60.0 to see if it helps (I did
it both for mds and mons daemons because it's seems to be present in
both conf).
I will tells you if it works.

Any other clue ?
F.

Le 05/06/2020 à 14:44, Frank Schilder a écrit :
Hi Francois,

thanks for the link. The option "mds dump cache after rejoin" is for debugging purposes only. It will write the cache after rejoin to a file, but not drop the cache. This will not help you. I think this was implemented recently to make it possible to send a cache dump file to developers after an MDS crash before the restarting MDS changes the cache.

In your case, I would set osd_op_queue_cut_off during the next regular cluster service or upgrade.

My best bet right now is to try to add swap. Maybe someone else reading this has a better idea or you find a hint in one of the other threads.

Good luck!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Francois Legrand<fleg@xxxxxxxxxxxxxx>
Sent: 05 June 2020 14:34:06
To: Frank Schilder; ceph-users
Subject: Re:  mds behind on trimming - replay until memory exhausted

Le 05/06/2020 à 14:18, Frank Schilder a écrit :
Hi Francois,

I was also wondering if setting mds dump cache after rejoin could help ?
Haven't heard of that option. Is there some documentation?
I found it on :
https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/
mds dump cache after rejoin
Description
Ceph will dump MDS cache contents to a file after rejoining the cache
(during recovery).
Type
Boolean
Default
false

but I don't think it can help in my case, because rejoin occurs after
replay and in my case replay never ends !

I have :
osd_op_queue=wpq
osd_op_queue_cut_off=low
I can try to set osd_op_queue_cut_off to high, but it will be useful
only if the mds get active, true ?
I think so. If you have no clients connected, there should not be queue priority issues. Maybe it is best to wait until your cluster is healthy again as you will need to restart all daemons. Make sure you set this in [global]. When I applied that change and after re-starting all OSDs my MDSes had reconnect issues until I set it on them too. I think all daemons use that option (the prefix osd_ is misleading).
For sure I would prefer not to restart all daemons because the second
filesystem is up and running (with production clients).

For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB
which seems reasonable for a mds server with 32/48GB).
This sounds bad. 8GB should not cause any issues. Maybe you are hitting a bug, I believe there is a regression in Nautilus. There were recent threads on absurdly high memory use by MDSes. Maybe its worth searching for these in the list.
I will have a look.

I already force the clients to unmount (and even rebooted the ones from
which the rsync and the rmdir .snaps were launched).
I don't know when the MDS acknowledges this. If is was a clean unmount (i.e. without -f or forced by reboot) the MDS should have dropped the clients already. If it was an unclean unmount it might not be that easy to get the stale client session out. However, I don't know about that.
Moreover when I did that, the mds was already not active but in replay,
so for sure the unmount was not acknowledged by any mds !

I think that providing more swap maybe the solution ! I will try that if
I cannot find another way to fix it.
If the memory overrun is somewhat limited, this should allow the MDS to trim the logs. Will take a while, but it will do eventually.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Francois Legrand<fleg@xxxxxxxxxxxxxx>
Sent: 05 June 2020 13:46:03
To: Frank Schilder; ceph-users
Subject: Re:  mds behind on trimming - replay until memory exhausted

I was also wondering if setting mds dump cache after rejoin could help ?

Le 05/06/2020 à 12:49, Frank Schilder a écrit :
Out of interest, I did the same on a mimic cluster a few months ago, running up to 5 parallel rsync sessions without any problems. I moved about 120TB. Each rsync was running on a separate client with its own cache. I made sure that the sync dirs were all disjoint (no overlap of files/directories).

How many rsync processes are you running in parallel?
Do you have these settings enabled:

           osd_op_queue=wpq
           osd_op_queue_cut_off=high

WPQ should be default, but osd_op_queue_cut_off=high might not be. Setting the latter removed any behind trimming problems we have seen before.

You are in a somewhat peculiar situation. I think you need to trim client caches, which requires an active MDS. If your MDS becomes active for at least some time, you could try the following (I'm not an expert here, so take with a grain of scepticism):

- reduce the MDS cache memory limit to force recall of caps much earlier than now
- reduce client cach size
- set "osd_op_queue_cut_off=high" if not already done so, I think this requires restart of OSDs, so be careful

At this point, you could watch your restart cycle to see if things improve already. Maybe nothing more is required.

If you have good SSDs, you could try to provide temporarily some swap space. It saved me once. This will be very slow, but at least it might allow you to move forward.

Harder measures:

- stop all I/O from the FS clients, throw users out if necessary
- ideally, try to cleanly (!) shut down clients or force trimming the cache by either
           * umount or
           * sync; echo 3 > /proc/sys/vm/drop_caches
           Either of these might hang for a long time. Do not interrupt and do not do this on more than one client at a time.

At some point, your active MDS should be able to hold a full session. You should then tune the cache and other parameters such that the MDSes can handle your rsync sessions.

My experience is that MDSes overrun their cache limits quite a lot. Since I reduced mds_cache_memory_limit to not more than half of what is physically available, I have not had any problems again.

Hope that helps.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Francois Legrand<fleg@xxxxxxxxxxxxxx>
Sent: 05 June 2020 11:42:42
To: ceph-users
Subject:  mds behind on trimming - replay until memory exhausted

Hi all,
We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and
3 mds (1 active for each fs + one failover).
We are transfering all the datas (~600M files) from one FS (which was in
EC 3+2) to the other FS (in R3).
On the old FS we first removed the snapshots (to avoid strays problems
when removing files) and the ran some rsync deleting the files after the
transfer.
The operation should last a few weeks more to complete.
But few days ago, we started to have some warning mds behind on trimming
from the mds managing the old FS.
Yesterday, I restarted the active mds service to force the takeover by
the standby mds (basically because the standby is more powerfull and
have more memory, i.e 48GB over 32).
The standby mds took the rank 0 and started to replay... the mds behind
on trimming came back and the number of segments rised as well as the
memory usage of the server. Finally, it exhausted the memory of the mds
and the service stopped and the previous mds took rank 0 and started to
replay... until memory exhaustion and a new switch of mds etc...
It thus seems that we are in a never ending loop ! And of course, as the
mds is always in replay, the data are not accessible and the transfers
are blocked.
I stopped all the rsync and unmount the clients.

My questions are :
- Does the mds trim during the replay so we could hope that after a
while it will purge everything and the mds will be able to become active
at the end ?
- Is there a way to accelerate the operation or to fix this situation ?

Thanks for you help.
F.
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an emailtoceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx