I was able to start another MDS daemon on another node that had 512GB
RAM, and then the active MDS eventually migrated there, and went through
the replay (which consumed about 100 GB of RAM), and then things
recovered. Phew. I guess I need significantly more RAM in my MDS
servers... I had no idea the MDS daemon could require that much RAM.
-erich
On 4/22/24 11:41 AM, Erich Weiler wrote:
possibly but it would be pretty time consuming and difficult...
Is it maybe a RAM issue since my MDS RAM is filling up? Should maybe I
bring up another MDS on another server with huge amount of RAM and move
the MDS there in hopes it will have enough RAM to complete the replay?
On 4/22/24 11:37 AM, Sake Ceph wrote:
Just a question: is it possible to block or disable all clients? Just
to prevent load on the system.
Kind regards,
Sake
Op 22-04-2024 20:33 CEST schreef Erich Weiler <weiler@xxxxxxxxxxxx>:
I also see this from 'ceph health detail':
# ceph health detail
HEALTH_WARN 1 filesystem is degraded; 1 MDSs report oversized cache; 1
MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
fs slugfs is degraded
[WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mds.slugfs.pr-md-01.xdtppo(mds.0): MDS cache is too large
(19GB/8GB); 0 inodes in use by clients, 0 stray files
[WRN] MDS_TRIM: 1 MDSs behind on trimming
mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (127084/250)
max_segments: 250, num_segments: 127084
MDS cache too large? The mds process is taking up 22GB right now and
starting to swap my server, so maybe it somehow is too large....
On 4/22/24 11:17 AM, Erich Weiler wrote:
Hi All,
We have a somewhat serious situation where we have a cephfs filesystem
(18.2.1), and 2 active MDSs (one standby). ThI tried to restart one of
the active daemons to unstick a bunch of blocked requests, and the
standby went into 'replay' for a very long time, then RAM on that MDS
server filled up, and it just stayed there for a while then eventually
appeared to give up and switched to the standby, but the cycle started
again. So I restarted that MDS, and now I'm in a situation where I see
this:
# ceph fs status
slugfs - 29 clients
======
RANK STATE MDS ACTIVITY DNS INOS
DIRS CAPS
0 replay slugfs.pr-md-01.xdtppo 3958k 57.1k
12.2k 0
1 resolve slugfs.pr-md-02.sbblqq 0 3
1 0
POOL TYPE USED AVAIL
cephfs_metadata metadata 997G 2948G
cephfs_md_and_data data 0 87.6T
cephfs_data data 773T 175T
STANDBY MDS
slugfs.pr-md-03.mclckv
MDS version: ceph version 18.2.1
(7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
It just stays there indefinitely. All my clients are hung. I tried
restarting all MDS daemons and they just went back to this state after
coming back up.
Is there any way I can somehow escape this state of indefinite
replay/resolve?
Thanks so much! I'm kinda nervous since none of my clients have
filesystem access at the moment...
cheers,
erich
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx