MDS crashing

Johan <johan@xxxxxxxx> · Wed, 29 May 2024 21:14:55 +0200

Hi,

I have a small cluster with 11 osds and 4 filesystems. Each server 
(Debian 11, ceph 17.2.7) usually run several services.

After troubles with a host with OSD:s I removed the OSD:s and let the 
cluster repair it self (x3 replica). After a while it returned to a 
healthy state and everything was well. This might not be important for 
what followed, but I mention it just in case.

A couple of days later a mds-services gave a health-warning. First one 
was (2024-05-28 10:02)
  mds.cloudfs.stugan6.ywuomz(mds.0): 1 slow requests are blocked > 30 secs

followed by filesystem being degraded (2024-05-28 10:22)
  fs cloudfs is degraded
the other filesystems have been marked degraded from time to time but 
later cleared.

at 2024-05-28 10:28
  daemon mds.mediafs.stugan7.zzxavs on stugan7 is in error state
  daemon mds.cloudfs.stugan7.cmjbun on stugan7 is in error state
at 2024-05-28 10:33
  daemon mds.cloudfs.stugan4.qxwzox on stugan4 is in error state
  daemon mds.cloudfs.stugan5.hbkkad on stugan5 is in error state
  daemon mds.oxylfs.stugan7.iazekf on stugan7 is in error state

MDS-services went on crashing ...

I put the osd:s on pause and 
nodown,noout,nobackfill,norebalance,norecover flags, but at present only 
the flags as I have tried to get the system up and running again.

While the osd:s were paused, I could 'clear up' the mess and remove all 
services in error state. The monitors and managers seems to function 
well. I could also start getting the mds-services running again. BUT, 
when I removed the pause from the osd:s the mds-services once again 
started to go inte error state.

Now I have removed the mds-label from all ceph servers and therefore it 
has calmed down. But if I let the services be recreated the crashes will 
start over again.

If I check the filesystem (I have marked them down for now) status the 
cloudfs is strange...

oxylfs - 0 clients
======
      POOL         TYPE     USED  AVAIL
oxylfs_metadata  metadata   154M  20.8T
  oxylfs_data      data    1827G  20.8T
cloudfs - 0 clients
=======
RANK      STATE                MDS            ACTIVITY   DNS    INOS 
DIRS   CAPS
 0    replay(laggy)  backupfs.stugan6.bgcltx               0      0 
 0      0
      POOL          TYPE     USED  AVAIL
cloudfs_metadata  metadata   337M  20.8T
  cloudfs_data      data     356G  20.8T
mediafs - 0 clients
=======
      POOL          TYPE     USED  AVAIL
mediafs_metadata  metadata  66.0M  20.8T
  mediafs_data      data    2465G  20.8T
backupfs - 0 clients
========
        POOL            TYPE     USED  AVAIL
backupfsnew_metadata  metadata   221M  20.8T
  backupfsnew_data      data    8740G  20.8T
MDS version: ceph version 17.2.7 
(b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)

Why is the mds (in error state) for the backupfs-filesystem shown with 
the cloudfs-filesystem?

Now... Is there a way back to normal?

/Johan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx