Re: MDS daemons stuck in resolve, please help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The MDS cluster came back up again, but I lost a number of standby MDS daemons. I cleared the OSD blacklist, but they do not show up as stand-by daemons again. The daemon itself is running, but does not seem to re-join the cluster. The log shows:

2021-08-30 21:32:34.896 7fc9e22f8700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2021-08-30 21:32:39.896 7fc9e22f8700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2021-08-30 21:32:44.896 7fc9e22f8700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2021-08-30 21:32:49.897 7fc9e22f8700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15

I just had another frenzy of MDS fail-overs and am running out of stand-b daemons. A restart of a "missing" daemon brings it back to life, but I would prefer this to work by itself. Any ideas on what's going on are welcome.

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: 30 August 2021 21:12:53
To: ceph-users
Subject:  MDS daemons stuck in resolve, please help

Hi all,

our MDS cluster got degraded after an MDS had an oversized cache and crashed. Other MDS daemons followed suit and now they are stuck in this state:

[root@gnosis ~]# ceph fs status
con-fs2 - 1640 clients
=======
+------+---------+---------+---------------+-------+-------+
| Rank |  State  |   MDS   |    Activity   |  dns  |  inos |
+------+---------+---------+---------------+-------+-------+
|  0   | resolve | ceph-24 |               | 22.1k | 22.0k |
|  1   | resolve | ceph-13 |               |  769k |  758k |
|  2   |  active | ceph-16 | Reqs:    0 /s |  255k |  255k |
|  3   | resolve | ceph-09 |               | 5624  | 5619  |
+------+---------+---------+---------------+-------+-------+
+---------------------+----------+-------+-------+
|         Pool        |   type   |  used | avail |
+---------------------+----------+-------+-------+
|    con-fs2-meta1    | metadata | 1828M | 1767G |
|    con-fs2-meta2    |   data   |    0  | 1767G |
|     con-fs2-data    |   data   | 1363T | 6049T |
| con-fs2-data-ec-ssd |   data   |  239G | 4241G |
|    con-fs2-data2    |   data   | 10.2T | 5499T |
+---------------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
|   ceph-12   |
|   ceph-08   |
|   ceph-23   |
|   ceph-11   |
+-------------+

I tried to set max_mds to 1 to no avail. How can I get the MDS daemons back up?

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux