MDS crashing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I have a small cluster with 11 osds and 4 filesystems. Each server (Debian 11, ceph 17.2.7) usually run several services.

After troubles with a host with OSD:s I removed the OSD:s and let the cluster repair it self (x3 replica). After a while it returned to a healthy state and everything was well. This might not be important for what followed, but I mention it just in case.

A couple of days later a mds-services gave a health-warning. First one was (2024-05-28 10:02)
  mds.cloudfs.stugan6.ywuomz(mds.0): 1 slow requests are blocked > 30 secs

followed by filesystem being degraded (2024-05-28 10:22)
  fs cloudfs is degraded
the other filesystems have been marked degraded from time to time but later cleared.

at 2024-05-28 10:28
  daemon mds.mediafs.stugan7.zzxavs on stugan7 is in error state
  daemon mds.cloudfs.stugan7.cmjbun on stugan7 is in error state
at 2024-05-28 10:33
  daemon mds.cloudfs.stugan4.qxwzox on stugan4 is in error state
  daemon mds.cloudfs.stugan5.hbkkad on stugan5 is in error state
  daemon mds.oxylfs.stugan7.iazekf on stugan7 is in error state

MDS-services went on crashing ...

I put the osd:s on pause and nodown,noout,nobackfill,norebalance,norecover flags, but at present only the flags as I have tried to get the system up and running again.



While the osd:s were paused, I could 'clear up' the mess and remove all services in error state. The monitors and managers seems to function well. I could also start getting the mds-services running again. BUT, when I removed the pause from the osd:s the mds-services once again started to go inte error state.

Now I have removed the mds-label from all ceph servers and therefore it has calmed down. But if I let the services be recreated the crashes will start over again.

If I check the filesystem (I have marked them down for now) status the cloudfs is strange...

oxylfs - 0 clients
======
      POOL         TYPE     USED  AVAIL
oxylfs_metadata  metadata   154M  20.8T
  oxylfs_data      data    1827G  20.8T
cloudfs - 0 clients
=======
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 replay(laggy) backupfs.stugan6.bgcltx 0 0 0 0
      POOL          TYPE     USED  AVAIL
cloudfs_metadata  metadata   337M  20.8T
  cloudfs_data      data     356G  20.8T
mediafs - 0 clients
=======
      POOL          TYPE     USED  AVAIL
mediafs_metadata  metadata  66.0M  20.8T
  mediafs_data      data    2465G  20.8T
backupfs - 0 clients
========
        POOL            TYPE     USED  AVAIL
backupfsnew_metadata  metadata   221M  20.8T
  backupfsnew_data      data    8740G  20.8T
MDS version: ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)


Why is the mds (in error state) for the backupfs-filesystem shown with the cloudfs-filesystem?


Now... Is there a way back to normal?

/Johan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux