Re: Cephfs - MDS all up:standby, not becoming up:active

胡玮文 <huww98@xxxxxxxxxxx> · Fri, 17 Sep 2021 22:02:51 +0000

Thanks again. Now my CephFS is back online!

I ended up build ceph-mon from source myself, with the following patch applied. and only replacing the mon leader seems sufficient.

Now I’m interested in why such a routine automated minor version upgrade could get the cluster into such a state in the first place.

diff --git a/src/mon/MDSMonitor.cc b/src/mon/MDSMonitor.cc
index 4373938..786f227 100644
--- a/src/mon/MDSMonitor.cc
+++ b/src/mon/MDSMonitor.cc
@@ -1526,7 +1526,7 @@ int MDSMonitor::filesystem_command(
     ss << "removed mds gid " << gid;
     return 0;
     }
-  } else if (prefix == "mds rmfailed") {
+  } else if (prefix == "mds addfailed") {
     bool confirm = false;
     cmd_getval(cmdmap, "yes_i_really_mean_it", confirm);
     if (!confirm) {
@@ -1554,10 +1554,10 @@ int MDSMonitor::filesystem_command(
         role.fscid,
         [role](std::shared_ptr<Filesystem> fs)
     {
-      fs->mds_map.failed.erase(role.rank);
+      fs->mds_map.failed.insert(role.rank);
     });

-    ss << "removed failed mds." << role;
+    ss << "added failed mds." << role;
     return 0;
     /* TODO: convert to fs commands to update defaults */
   } else if (prefix == "mds compat rm_compat") {
diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h
index 463419b..5c6a927 100644
--- a/src/mon/MonCommands.h
+++ b/src/mon/MonCommands.h
@@ -334,7 +334,7 @@ COMMAND("mds repaired name=role,type=CephString",
COMMAND("mds rm "
        "name=gid,type=CephInt,range=0",
        "remove nonactive mds", "mds", "rw")
-COMMAND_WITH_FLAG("mds rmfailed name=role,type=CephString "
+COMMAND_WITH_FLAG("mds addfailed name=role,type=CephString "
         "name=yes_i_really_mean_it,type=CephBool,req=false",
        "remove failed rank", "mds", "rw", FLAG(HIDDEN))
COMMAND_WITH_FLAG("mds cluster_down", "take MDS cluster down", "mds", "rw", FLAG(OBSOLETE))

发件人: Patrick Donnelly<mailto:pdonnell@xxxxxxxxxx>
发送时间: 2021年9月18日 5:06
收件人: 胡 玮文<mailto:huww98@xxxxxxxxxxx>
抄送: Eric Dold<mailto:dold.eric@xxxxxxxxx>; ceph-users<mailto:ceph-users@xxxxxxx>
主题: Re: Cephfs - MDS all up:standby, not becoming up:active

On Fri, Sep 17, 2021 at 3:17 PM 胡 玮文 <huww98@xxxxxxxxxxx> wrote:
>
> > Did you run the command I suggested before or after you executed `rmfailed` below?
>
>
>
> I run “rmfailed” before reading your mail. Then I got MON crashed. I fixed the crash by setting max_mds=2. Then I tried the command you suggested.
>
>
>
> By reading the code[1], I think I really need to undo the “rmfailed” to get my MDS out of standby state.

Exactly. If you install the repositories from (available in about ~1 hour):

https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fshaman.ceph.com%2Frepos%2Fceph%2Fceph-mds-addfailed-pacific%2F9a1ccf41c32446e1b31328e7d01ea8e4aaea8cbb%2F&amp;data=04%7C01%7C%7C997ad71e82e84d125e4108d97a1ef2f0%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637675095612004570%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=wPaJ8yc5vFyMh%2BjqBtFjXCgCpQPqqbENrQ5K8n6EhO8%3D&amp;reserved=0

for the monitors (only), and then run:

for i in 0 1; do ceph mds addfailed <fs_name>:$i --yes-i-really-mean-it ; done

it should fix it for you.

--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx