Hi Paul, On Fri, 11 Jun 2010, Paul wrote: > Hiding the details of the id number sequence from the user seems like > a great idea. > > Havn't seen any serious problems while running the mon-remove branch. > A few minor ones: > 1. the mkcephfs and init scripts need a bit of tweaking to parse the > new config format. Also, "mon0" used to have a special meaning in > mkcephfs... going to need an equivalent in the new scheme. > 2. monmaptool could use an update to it's usage string Fixed these. > 3. mon add needs to return a reply immediately as well since the rank > might also change during an add This was actually a problem with how the request forwarding/proxying was happening. I simplified that code a bit (using the MonSession now to track proxy info instead of hidden fields in the message) and it seems to be working just fine. > Starting the cluster using the old config format doesn't work - the > other daemons complain that they can't find any monitors. I think this was just monclient's parsing of the ceph.conf.. I made the '.' optional (mon0 or mon.0 or monfoo or mon.foo all ok). Pushed to mon-remove on ceph.git.. let me know if you have any problems. Not sure if we should merge it for v0.21 or not. Depends on how much testing we'll do in the next couple weeks. sage > > > Paul C > > On Wed, Jun 9, 2010 at 2:56 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > > On Tue, 8 Jun 2010, Paul wrote: > >> Could you take a look at the commits at http://github.com/tcloud/ceph > >> on the mon-remove branch please. > > > > It seems to work pretty well. I changed the LogClient interaction to be > > less annoying, simplified the mon removal check, and did a few small > > cleanups. I pushed a mon-remove branch to > > git://ceph.newdream.net/git/ceph.git. > > > > The one remaining issue I see is the way that monitors are > > identified/named. We dealt with a similar issue with the MDSs a while > > back. The problem is that we need to identify a particular instance of > > cmon somehow (in ceph.conf, and at startup) so that cmon knows which > > address to bind to. But if you have 3 monitors, identified as 0, 1, 2, > > and remove 1, then 2 becomes 1. We fixed things up so that the running > > instance adjusts it's rank, but the ceph.conf is still wrong. > > > > We could take a similar route as we did with the mds, and identify each > > cmon instance by name and not just by number. Then the ceph.conf entry > > would look like > > > > [mon.foo] > > ... > > > > and on startup you'd run > > > > $ cmon -i foo # and not 'cmon -i 0' > > > > That would mean adjusting the monmap to store the names of monitors. The > > numeric 'rank' would just the current position of a monitor during a > > particular epoch. > > > > The result would be more like > > > > $ ceph mon add foo 1.2.3.4:6789 > > $ ceph mon remove bar > > or > > $ ceph mon remove 1.2.3.6:6789 > > > > Alternatively, we could keep the monmap the same, and allow ranks to be > > empty, so that we could have say mon 0, 2, and 3. That would require some > > MonClient changes to the code that picks a new monitor, and to any other > > code that might assume that monmap.size() determines the range of existing > > monitors. > > > > I think I prefer the first approach, because it is easier for people to > > identify specific instances in terms of a name (not a number with specific > > constraints). And because the MonMap encoding could be simplified anyway > > (no reason for it to be an entity_inst_t vector instead of an > > entity_addr_t vector). > > > > sage > > > > > > > >> > >> The second case was partly due to the MonClient trying to contact the > >> monitor in the new map that has the same id (which is now mapped to a > >> different address) as the one it was talking to before. Added a patch > >> for this. > >> > >> Also, about the AuthMon issue we were talking about earlier: I wasn't > >> able to reproduce it afterwards. Hopefully the root cause is the same > >> as above. > >> > >> Thanks, > >> Paul C > >> > >> On Tue, Jun 8, 2010 at 1:38 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > >> > On Mon, 7 Jun 2010, Paul wrote: > >> >> Took a shot at implementing the remove mon command, turned out to be > >> >> not _quite_ as trivial as I'd expected. > >> >> > >> >> The monitors need to adjust their whoami values after getting the new > >> >> map, which is annoying since the value is stored in quite a few > >> >> places. Ended up removing all the direct copies I could find and had > >> >> them lookup the value in Monitor instead. > >> > > >> > Looks good. > >> > > >> >> The interaction with MonClient (i.e. via ceph command) needs a bit > >> >> more work, but I'm not really sure how to proceed at this point. If > >> >> the client happens to pick the monitor that is to be removed to send > >> >> the remove command to, it'll retry with another monitor and get a > >> >> "does not exist" error (because the first operation already removed > >> >> it). If the client happens to pick a monitor whose id is greater than > >> >> the one to be removed, it will never get an response to the command. > >> > > >> > Hmm. If we can get the reply sent before the monmap change is broadcast, > >> > the first case should be avoidable. I'm not sure what would case the > >> > second case. > >> > > >> > The patches below are mangled (wordwrapped) by gmail. Can you resend them > >> > using 'git send-email'? Or if they're in a public git tree somewhere I > >> > can grab them from there. > >> > > >> > Thanks! > >> > sage > >> > > >> > > >> >> > >> >> >From 1d30deea10266392328c6fbdf176b766db42c32f Mon Sep 17 00:00:00 2001 > >> >> From: Paul Chiang <paul_chiang@xxxxxxxxxxxxxxxxxxx> > >> >> Date: Mon, 7 Jun 2010 09:47:48 +0800 > >> >> Subject: [PATCH 1/2] Removed all copies of the whoami value > >> >> > >> >> > >> >> Signed-off-by: Paul Chiang <paul_chiang@xxxxxxxxxxxxxxxxxxx> > >> >> --- > >> >> src/mon/Elector.cc | 10 +++++----- > >> >> src/mon/Elector.h | 3 +-- > >> >> src/mon/Monitor.cc | 4 ++-- > >> >> src/mon/Paxos.cc | 16 ++++++++-------- > >> >> src/mon/Paxos.h | 5 ++--- > >> >> 5 files changed, 18 insertions(+), 20 deletions(-) > >> >> > >> >> diff --git a/src/mon/Elector.cc b/src/mon/Elector.cc > >> >> index 7f8fa62..41fde6e 100644 > >> >> --- a/src/mon/Elector.cc > >> >> +++ b/src/mon/Elector.cc > >> >> @@ -69,11 +69,11 @@ void Elector::start() > >> >> bump_epoch(epoch+1); // odd == election cycle > >> >> start_stamp = g_clock.now(); > >> >> electing_me = true; > >> >> - acked_me.insert(whoami); > >> >> + acked_me.insert(mon->whoami); > >> >> > >> >> // bcast to everyone else > >> >> for (unsigned i=0; i<mon->monmap->size(); ++i) { > >> >> - if ((int)i == whoami) continue; > >> >> + if ((int)i == mon->whoami) continue; > >> >> mon->messenger->send_message(new > >> >> MMonElection(MMonElection::OP_PROPOSE, epoch, mon->monmap), > >> >> mon->monmap->get_inst(i)); > >> >> } > >> >> @@ -151,7 +151,7 @@ void Elector::victory() > >> >> for (set<int>::iterator p = quorum.begin(); > >> >> p != quorum.end(); > >> >> ++p) { > >> >> - if (*p == whoami) continue; > >> >> + if (*p == mon->whoami) continue; > >> >> MMonElection *m = new MMonElection(MMonElection::OP_VICTORY, > >> >> epoch, mon->monmap); > >> >> m->quorum = quorum; > >> >> mon->messenger->send_message(m, mon->monmap->get_inst(*p)); > >> >> @@ -186,7 +186,7 @@ void Elector::handle_propose(MMonElection *m) > >> >> } > >> >> } > >> >> > >> >> - if (whoami < from) { > >> >> + if (mon->whoami < from) { > >> >> // i would win over them. > >> >> if (leader_acked >= 0) { // we already acked someone > >> >> assert(leader_acked < from); // and they still win, of course > >> >> @@ -250,7 +250,7 @@ void Elector::handle_victory(MMonElection *m) > >> >> dout(5) << "handle_victory from " << m->get_source() << dendl; > >> >> int from = m->get_source().num(); > >> >> > >> >> - assert(from < whoami); > >> >> + assert(from < mon->whoami); > >> >> assert(m->epoch % 2 == 0); > >> >> > >> >> // i should have seen this election if i'm getting the victory. > >> >> diff --git a/src/mon/Elector.h b/src/mon/Elector.h > >> >> index 9bfd7cb..8a21e09 100644 > >> >> --- a/src/mon/Elector.h > >> >> +++ b/src/mon/Elector.h > >> >> @@ -32,7 +32,6 @@ class Monitor; > >> >> class Elector { > >> >> private: > >> >> Monitor *mon; > >> >> - int whoami; > >> >> > >> >> Context *expire_event; > >> >> > >> >> @@ -71,7 +70,7 @@ class Elector { > >> >> void handle_victory(class MMonElection *m); > >> >> > >> >> public: > >> >> - Elector(Monitor *m, int w) : mon(m), whoami(w), > >> >> + Elector(Monitor *m) : mon(m), > >> >> expire_event(0), > >> >> epoch(0), > >> >> electing_me(false), > >> >> diff --git a/src/mon/Monitor.cc b/src/mon/Monitor.cc > >> >> index cf0b20b..db086be 100644 > >> >> --- a/src/mon/Monitor.cc > >> >> +++ b/src/mon/Monitor.cc > >> >> @@ -93,7 +93,7 @@ Monitor::Monitor(int w, MonitorStore *s, Messenger > >> >> *m, MonMap *map) : > >> >> > >> >> state(STATE_STARTING), stopping(false), > >> >> > >> >> - elector(this, w), > >> >> + elector(this), > >> >> mon_epoch(0), > >> >> leader(0), > >> >> paxos(PAXOS_NUM), paxos_service(PAXOS_NUM), > >> >> @@ -114,7 +114,7 @@ Monitor::Monitor(int w, MonitorStore *s, Messenger > >> >> *m, MonMap *map) : > >> >> > >> >> Paxos *Monitor::add_paxos(int type) > >> >> { > >> >> - Paxos *p = new Paxos(this, whoami, type); > >> >> + Paxos *p = new Paxos(this, type); > >> >> paxos[type] = p; > >> >> return p; > >> >> } > >> >> diff --git a/src/mon/Paxos.cc b/src/mon/Paxos.cc > >> >> index d10f31c..9c247e8 100644 > >> >> --- a/src/mon/Paxos.cc > >> >> +++ b/src/mon/Paxos.cc > >> >> @@ -23,7 +23,7 @@ > >> >> > >> >> #define DOUT_SUBSYS paxos > >> >> #undef dout_prefix > >> >> -#define dout_prefix _prefix(mon, whoami, machine_name, state, last_committed) > >> >> +#define dout_prefix _prefix(mon, mon->whoami, machine_name, state, > >> >> last_committed) > >> >> static ostream& _prefix(Monitor *mon, int whoami, const char > >> >> *machine_name, int state, version_t last_committed) { > >> >> return *_dout << dbeginl > >> >> << "mon" << whoami > >> >> @@ -87,7 +87,7 @@ void Paxos::collect(version_t oldpn) > >> >> for (set<int>::const_iterator p = mon->get_quorum().begin(); > >> >> p != mon->get_quorum().end(); > >> >> ++p) { > >> >> - if (*p == whoami) continue; > >> >> + if (*p == mon->whoami) continue; > >> >> > >> >> MMonPaxos *collect = new MMonPaxos(mon->get_epoch(), > >> >> MMonPaxos::OP_COLLECT, machine_id); > >> >> collect->last_committed = last_committed; > >> >> @@ -336,7 +336,7 @@ void Paxos::begin(bufferlist& v) > >> >> > >> >> // accept it ourselves > >> >> accepted.clear(); > >> >> - accepted.insert(whoami); > >> >> + accepted.insert(mon->whoami); > >> >> new_value = v; > >> >> mon->store->put_bl_sn(new_value, machine_name, last_committed+1); > >> >> > >> >> @@ -356,7 +356,7 @@ void Paxos::begin(bufferlist& v) > >> >> for (set<int>::const_iterator p = mon->get_quorum().begin(); > >> >> p != mon->get_quorum().end(); > >> >> ++p) { > >> >> - if (*p == whoami) continue; > >> >> + if (*p == mon->whoami) continue; > >> >> > >> >> dout(10) << " sending begin to mon" << *p << dendl; > >> >> MMonPaxos *begin = new MMonPaxos(mon->get_epoch(), > >> >> MMonPaxos::OP_BEGIN, machine_id); > >> >> @@ -483,7 +483,7 @@ void Paxos::commit() > >> >> for (set<int>::const_iterator p = mon->get_quorum().begin(); > >> >> p != mon->get_quorum().end(); > >> >> ++p) { > >> >> - if (*p == whoami) continue; > >> >> + if (*p == mon->whoami) continue; > >> >> > >> >> dout(10) << " sending commit to mon" << *p << dendl; > >> >> MMonPaxos *commit = new MMonPaxos(mon->get_epoch(), > >> >> MMonPaxos::OP_COMMIT, machine_id); > >> >> @@ -527,7 +527,7 @@ void Paxos::extend_lease() > >> >> lease_expire = g_clock.now(); > >> >> lease_expire += g_conf.mon_lease; > >> >> acked_lease.clear(); > >> >> - acked_lease.insert(whoami); > >> >> + acked_lease.insert(mon->whoami); > >> >> > >> >> dout(7) << "extend_lease now+" << g_conf.mon_lease << " (" << > >> >> lease_expire << ")" << dendl; > >> >> > >> >> @@ -535,7 +535,7 @@ void Paxos::extend_lease() > >> >> for (set<int>::const_iterator p = mon->get_quorum().begin(); > >> >> p != mon->get_quorum().end(); > >> >> ++p) { > >> >> - if (*p == whoami) continue; > >> >> + if (*p == mon->whoami) continue; > >> >> MMonPaxos *lease = new MMonPaxos(mon->get_epoch(), > >> >> MMonPaxos::OP_LEASE, machine_id); > >> >> lease->last_committed = last_committed; > >> >> lease->lease_timestamp = lease_expire; > >> >> @@ -724,7 +724,7 @@ version_t Paxos::get_new_proposal_number(version_t gt) > >> >> last_pn /= 100; > >> >> last_pn++; > >> >> last_pn *= 100; > >> >> - last_pn += (version_t)whoami; > >> >> + last_pn += (version_t)mon->whoami; > >> >> > >> >> // write > >> >> mon->store->put_int(last_pn, machine_name, "last_pn"); > >> >> diff --git a/src/mon/Paxos.h b/src/mon/Paxos.h > >> >> index c6eff22..9d34605 100644 > >> >> --- a/src/mon/Paxos.h > >> >> +++ b/src/mon/Paxos.h > >> >> @@ -68,7 +68,6 @@ class Paxos; > >> >> // i am one state machine. > >> >> class Paxos { > >> >> Monitor *mon; > >> >> - int whoami; > >> >> > >> >> // my state machine info > >> >> int machine_id; > >> >> @@ -225,8 +224,8 @@ private: > >> >> version_t get_new_proposal_number(version_t gt=0); > >> >> > >> >> public: > >> >> - Paxos(Monitor *m, int w, > >> >> - int mid) : mon(m), whoami(w), > >> >> + Paxos(Monitor *m, > >> >> + int mid) : mon(m), > >> >> machine_id(mid), > >> >> machine_name(get_paxos_name(mid)), > >> >> state(STATE_RECOVERING), > >> >> -- > >> >> 1.6.4.2 > >> >> > >> >> >From eb7308104c33aa50195e3736483c6bf9a145a518 Mon Sep 17 00:00:00 2001 > >> >> From: Paul Chiang <paul_chiang@xxxxxxxxxxxxxxxxxxx> > >> >> Date: Mon, 7 Jun 2010 10:16:39 +0800 > >> >> Subject: [PATCH 2/2] Introduced ceph mon remove command > >> >> > >> >> > >> >> Signed-off-by: Paul Chiang <paul_chiang@xxxxxxxxxxxxxxxxxxx> > >> >> --- > >> >> src/common/LogClient.h | 1 + > >> >> src/mon/Monitor.cc | 1 + > >> >> src/mon/Monitor.h | 1 + > >> >> src/mon/MonmapMonitor.cc | 61 ++++++++++++++++++++++++++++++++++++++++++++++ > >> >> src/mon/MonmapMonitor.h | 1 + > >> >> 5 files changed, 65 insertions(+), 0 deletions(-) > >> >> > >> >> diff --git a/src/common/LogClient.h b/src/common/LogClient.h > >> >> index 0185eec..bc5e227 100644 > >> >> --- a/src/common/LogClient.h > >> >> +++ b/src/common/LogClient.h > >> >> @@ -56,6 +56,7 @@ class LogClient : public Dispatcher { > >> >> void send_log(); > >> >> void handle_log_ack(MLogAck *m); > >> >> void set_synchronous(bool sync) { is_synchronous = sync; } > >> >> + void set_mon(int mon_id) { mon = mon_id; } > >> >> > >> >> LogClient(Messenger *m, MonMap *mm) : > >> >> messenger(m), monmap(mm), mon(-1), is_synchronous(false), > >> >> diff --git a/src/mon/Monitor.cc b/src/mon/Monitor.cc > >> >> index db086be..259b6ce 100644 > >> >> --- a/src/mon/Monitor.cc > >> >> +++ b/src/mon/Monitor.cc > >> >> @@ -110,6 +110,7 @@ Monitor::Monitor(int w, MonitorStore *s, Messenger > >> >> *m, MonMap *map) : > >> >> mon_caps = new MonCaps(); > >> >> mon_caps->set_allow_all(true); > >> >> mon_caps->text = "allow *"; > >> >> + myaddr = map->get_inst(w).addr; > >> >> } > >> >> > >> >> Paxos *Monitor::add_paxos(int type) > >> >> diff --git a/src/mon/Monitor.h b/src/mon/Monitor.h > >> >> index 829c506..a673972 100644 > >> >> --- a/src/mon/Monitor.h > >> >> +++ b/src/mon/Monitor.h > >> >> @@ -58,6 +58,7 @@ class Monitor : public Dispatcher { > >> >> public: > >> >> // me > >> >> int whoami; > >> >> + entity_addr_t myaddr; > >> >> Messenger *messenger; > >> >> Mutex lock; > >> >> > >> >> diff --git a/src/mon/MonmapMonitor.cc b/src/mon/MonmapMonitor.cc > >> >> index 0aca5fb..5368ab7 100644 > >> >> --- a/src/mon/MonmapMonitor.cc > >> >> +++ b/src/mon/MonmapMonitor.cc > >> >> @@ -48,6 +48,7 @@ bool MonmapMonitor::update_from_paxos() > >> >> dout(10) << "update_from_paxos paxosv " << paxosv > >> >> << ", my v " << mon->monmap->epoch << dendl; > >> >> > >> >> + int original_map_size = mon->monmap->size(); > >> >> //read and decode > >> >> monmap_bl.clear(); > >> >> bool success = paxos->read(paxosv, monmap_bl); > >> >> @@ -58,6 +59,20 @@ bool MonmapMonitor::update_from_paxos() > >> >> //save the bufferlist version in the paxos instance as well > >> >> paxos->stash_latest(paxosv, monmap_bl); > >> >> > >> >> + if (original_map_size != mon->monmap->size()) > >> >> + { > >> >> + _update_whoami(); > >> >> + > >> >> + // call election? > >> >> + if (mon->monmap->size() > 1) { > >> >> + mon->call_election(); > >> >> + } else { > >> >> + // we're standalone. > >> >> + set<int> q; > >> >> + q.insert(mon->whoami); > >> >> + mon->win_election(1, q); > >> >> + } > >> >> + } > >> >> return true; > >> >> } > >> >> > >> >> @@ -127,6 +142,8 @@ bool MonmapMonitor::preprocess_command(MMonCommand *m) > >> >> } > >> >> else if (m->cmd[1] == "add") > >> >> return false; > >> >> + else if (m->cmd[1] == "remove") > >> >> + return false; > >> >> } > >> >> > >> >> if (r != -1) { > >> >> @@ -177,6 +194,23 @@ bool MonmapMonitor::prepare_command(MMonCommand *m) > >> >> paxos->wait_for_commit(new Monitor::C_Command(mon, m, 0, rs, > >> >> paxos->get_version())); > >> >> return true; > >> >> } > >> >> + else if (m->cmd.size() == 3 && m->cmd[1] == "remove") { > >> >> + entity_addr_t addr; > >> >> + parse_ip_port(m->cmd[2].c_str(), addr); > >> >> + bufferlist rdata; > >> >> + if (!pending_map.contains(addr)) { > >> >> + err = -ENOENT; > >> >> + ss << "mon " << addr << " does not exist"; > >> >> + goto out; > >> >> + } > >> >> + > >> >> + pending_map.remove(addr); > >> >> + pending_map.last_changed = g_clock.now(); > >> >> + ss << "removed mon" << " at " << addr << ", there are now " << > >> >> pending_map.size() << " monitors" ; > >> >> + getline(ss, rs); > >> >> + paxos->wait_for_commit(new Monitor::C_Command(mon, m, 0, rs, > >> >> paxos->get_version())); > >> >> + return true; > >> >> + } > >> >> else > >> >> ss << "unknown command " << m->cmd[1]; > >> >> } else > >> >> @@ -203,3 +237,30 @@ void MonmapMonitor::tick() > >> >> { > >> >> update_from_paxos(); > >> >> } > >> >> + > >> >> +void MonmapMonitor::_update_whoami() > >> >> +{ > >> >> + // first check if there is any change > >> >> + if (mon->whoami < mon->monmap->size() && > >> >> + mon->monmap->get_inst(mon->whoami).addr == mon->myaddr) > >> >> + { > >> >> + return; > >> >> + } > >> >> + > >> >> + // then check backwards starting from min(whoami-1, size-1) since > >> >> whoami only ever decreases > >> >> + unsigned i=(mon->whoami-1) < mon->monmap->size() ? > >> >> (mon->whoami-1):(mon->monmap->size()-1); > >> >> + for (; i>=0; i--) > >> >> + { > >> >> + if (mon->monmap->get_inst(i).addr == mon->myaddr) > >> >> + { > >> >> + dout(10) << "Changing whoami from " << mon->whoami << " to " << > >> >> i << dendl; > >> >> + mon->whoami = i; > >> >> + mon->messenger->set_myname(entity_name_t::MON(i)); > >> >> + mon->logclient.set_mon(i); > >> >> + return; > >> >> + } > >> >> + } > >> >> + dout(0) << "Cannot find myself (mon" << mon->whoami << ", " << > >> >> mon->myaddr << ") in new monmap! I must have been removed, shutting > >> >> down." << dendl; > >> >> + mon->shutdown(); > >> >> +} > >> >> + > >> >> diff --git a/src/mon/MonmapMonitor.h b/src/mon/MonmapMonitor.h > >> >> index fbb7fa5..b14c012 100644 > >> >> --- a/src/mon/MonmapMonitor.h > >> >> +++ b/src/mon/MonmapMonitor.h > >> >> @@ -67,6 +67,7 @@ class MonmapMonitor : public PaxosService { > >> >> > >> >> private: > >> >> bufferlist monmap_bl; > >> >> + void _update_whoami(); > >> >> }; > >> >> > >> >> > >> >> -- > >> >> 1.6.4.2 > >> >> -- > >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx > >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> >> > >> >> > >> > > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >> the body of a message to majordomo@xxxxxxxxxxxxxxx > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> > >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > >