Hi,
I have a small cluster with 11 osds and 4 filesystems. Each server
(Debian 11, ceph 17.2.7) usually run several services.
After troubles with a host with OSD:s I removed the OSD:s and let the
cluster repair it self (x3 replica). After a while it returned to a
healthy state and everything was well. This might not be important for
what followed, but I mention it just in case.
A couple of days later a mds-services gave a health-warning. First one
was (2024-05-28 10:02)
mds.cloudfs.stugan6.ywuomz(mds.0): 1 slow requests are blocked > 30 secs
followed by filesystem being degraded (2024-05-28 10:22)
fs cloudfs is degraded
the other filesystems have been marked degraded from time to time but
later cleared.
at 2024-05-28 10:28
daemon mds.mediafs.stugan7.zzxavs on stugan7 is in error state
daemon mds.cloudfs.stugan7.cmjbun on stugan7 is in error state
at 2024-05-28 10:33
daemon mds.cloudfs.stugan4.qxwzox on stugan4 is in error state
daemon mds.cloudfs.stugan5.hbkkad on stugan5 is in error state
daemon mds.oxylfs.stugan7.iazekf on stugan7 is in error state
MDS-services went on crashing ...
I put the osd:s on pause and
nodown,noout,nobackfill,norebalance,norecover flags, but at present only
the flags as I have tried to get the system up and running again.
While the osd:s were paused, I could 'clear up' the mess and remove all
services in error state. The monitors and managers seems to function
well. I could also start getting the mds-services running again. BUT,
when I removed the pause from the osd:s the mds-services once again
started to go inte error state.
Now I have removed the mds-label from all ceph servers and therefore it
has calmed down. But if I let the services be recreated the crashes will
start over again.
If I check the filesystem (I have marked them down for now) status the
cloudfs is strange...
oxylfs - 0 clients
======
POOL TYPE USED AVAIL
oxylfs_metadata metadata 154M 20.8T
oxylfs_data data 1827G 20.8T
cloudfs - 0 clients
=======
RANK STATE MDS ACTIVITY DNS INOS
DIRS CAPS
0 replay(laggy) backupfs.stugan6.bgcltx 0 0
0 0
POOL TYPE USED AVAIL
cloudfs_metadata metadata 337M 20.8T
cloudfs_data data 356G 20.8T
mediafs - 0 clients
=======
POOL TYPE USED AVAIL
mediafs_metadata metadata 66.0M 20.8T
mediafs_data data 2465G 20.8T
backupfs - 0 clients
========
POOL TYPE USED AVAIL
backupfsnew_metadata metadata 221M 20.8T
backupfsnew_data data 8740G 20.8T
MDS version: ceph version 17.2.7
(b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
Why is the mds (in error state) for the backupfs-filesystem shown with
the cloudfs-filesystem?
Now... Is there a way back to normal?
/Johan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx