Thanks again. Now my CephFS is back online! I ended up build ceph-mon from source myself, with the following patch applied. and only replacing the mon leader seems sufficient. Now I’m interested in why such a routine automated minor version upgrade could get the cluster into such a state in the first place. diff --git a/src/mon/MDSMonitor.cc b/src/mon/MDSMonitor.cc index 4373938..786f227 100644 --- a/src/mon/MDSMonitor.cc +++ b/src/mon/MDSMonitor.cc @@ -1526,7 +1526,7 @@ int MDSMonitor::filesystem_command( ss << "removed mds gid " << gid; return 0; } - } else if (prefix == "mds rmfailed") { + } else if (prefix == "mds addfailed") { bool confirm = false; cmd_getval(cmdmap, "yes_i_really_mean_it", confirm); if (!confirm) { @@ -1554,10 +1554,10 @@ int MDSMonitor::filesystem_command( role.fscid, [role](std::shared_ptr<Filesystem> fs) { - fs->mds_map.failed.erase(role.rank); + fs->mds_map.failed.insert(role.rank); }); - ss << "removed failed mds." << role; + ss << "added failed mds." << role; return 0; /* TODO: convert to fs commands to update defaults */ } else if (prefix == "mds compat rm_compat") { diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h index 463419b..5c6a927 100644 --- a/src/mon/MonCommands.h +++ b/src/mon/MonCommands.h @@ -334,7 +334,7 @@ COMMAND("mds repaired name=role,type=CephString", COMMAND("mds rm " "name=gid,type=CephInt,range=0", "remove nonactive mds", "mds", "rw") -COMMAND_WITH_FLAG("mds rmfailed name=role,type=CephString " +COMMAND_WITH_FLAG("mds addfailed name=role,type=CephString " "name=yes_i_really_mean_it,type=CephBool,req=false", "remove failed rank", "mds", "rw", FLAG(HIDDEN)) COMMAND_WITH_FLAG("mds cluster_down", "take MDS cluster down", "mds", "rw", FLAG(OBSOLETE)) 发件人: Patrick Donnelly<mailto:pdonnell@xxxxxxxxxx> 发送时间: 2021年9月18日 5:06 收件人: 胡 玮文<mailto:huww98@xxxxxxxxxxx> 抄送: Eric Dold<mailto:dold.eric@xxxxxxxxx>; ceph-users<mailto:ceph-users@xxxxxxx> 主题: Re: Cephfs - MDS all up:standby, not becoming up:active On Fri, Sep 17, 2021 at 3:17 PM 胡 玮文 <huww98@xxxxxxxxxxx> wrote: > > > Did you run the command I suggested before or after you executed `rmfailed` below? > > > > I run “rmfailed” before reading your mail. Then I got MON crashed. I fixed the crash by setting max_mds=2. Then I tried the command you suggested. > > > > By reading the code[1], I think I really need to undo the “rmfailed” to get my MDS out of standby state. Exactly. If you install the repositories from (available in about ~1 hour): https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fshaman.ceph.com%2Frepos%2Fceph%2Fceph-mds-addfailed-pacific%2F9a1ccf41c32446e1b31328e7d01ea8e4aaea8cbb%2F&data=04%7C01%7C%7C997ad71e82e84d125e4108d97a1ef2f0%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637675095612004570%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=wPaJ8yc5vFyMh%2BjqBtFjXCgCpQPqqbENrQ5K8n6EhO8%3D&reserved=0 for the monitors (only), and then run: for i in 0 1; do ceph mds addfailed <fs_name>:$i --yes-i-really-mean-it ; done it should fix it for you. -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx