Thanks for all of the helpful suggestions. We’re back up and running. We successfully re-created the monitor, and re-imported the keys. With cro.it’s help, we turned on the OSD Daemons, and things came right relatively smoothly (a few inactive/incomplete pgs, and a few expected small things). It was confirmed that when all OSDs are down, the monitor most likely won’t show accurate information on osd up/down status— this huge discrepancy was one thing that gave me pause bringing the OSDs back up. The root cause, according to them, was our having a mimic mds in the cluster (everything else was luminous) while upgrading to nautilus. This was a fairly sloppy mistake on our part, of course. Lessons learned. Fortunately we only had CephFS running as a proof of concept. It wasn’t in use, and could just be removed (rather than rebuilt). Incidentally, there are lot of different versions of the script around for rebuilding a mon from the osds, but they all had their issues and errors (for us). Here’s what worked for us (luminous/bluestore) in case it helps others: !#/bin/bash ms=/root/mon-store db=/root/db db_slow=/root/db.slow mkdir $ms hosts=“host1 host2 hos3" # collect the cluster map from stopped OSDs for host in $hosts; do ssh root@$host mkdir -p $ms ssh root@$host mkdir -p $db ssh root@$host mkdir -p $db_slow rsync -avz $ms/. root@$host:$ms rsync -avz $db/. root@$host:$db rsync -avz $db_slow/. root@$host:$db_slow rm -rf $ms ssh root@$host <<EOF for osd in /var/lib/ceph/osd/ceph-*; do ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms done EOF rsync -avz root@$host:$ms/. $ms rsync -avz root@$host:$db/. $db rsync -avz root@$host:$db_slow/. $db_slow done Cheers, Sean > On 19/02/2020, at 11:42 PM, Wido den Hollander <wido@xxxxxxxx> wrote: > > > > On 2/19/20 10:11 AM, Paul Emmerich wrote: >> On Wed, Feb 19, 2020 at 10:03 AM Wido den Hollander <wido@xxxxxxxx> wrote: >>> >>> >>> >>> On 2/19/20 8:49 AM, Sean Matheny wrote: >>>> Thanks, >>>> >>>>> If the OSDs have a newer epoch of the OSDMap than the MON it won't work. >>>> >>>> How can I verify this? (i.e the epoch of the monitor vs the epoch of the >>>> osd(s)) >>>> >>> >>> Check the status of the OSDs: >>> >>> $ ceph daemon osd.X status >>> >>> This should tell the newest map it has. >>> >>> Then check on the mons: >>> >>> $ ceph osd dump|head -n 10 >> >> mons are offline > > I think he said het got one MON back manually. His 'ceph -s' also show > it :-) > >> >>> Or using ceph-monstore-tool to see what the latest map is the MON has. >> >> ceph-monstore-tool <mon-dir> dump-keys >> >> Also useful: >> >> ceph-monstore-tool <mon-dir> get osdmap >> > > Indeed. My thought is that there is a mismatch in OSDMaps between the > MONs and OSDs which is causing these problems. > > Wido > >> Paul >> >>> >>> Wido >>> >>>> Cheers, >>>> Sean >>>> >>>> >>>>> On 19/02/2020, at 7:25 PM, Wido den Hollander <wido@xxxxxxxx >>>>> <mailto:wido@xxxxxxxx>> wrote: >>>>> >>>>> >>>>> >>>>> On 2/19/20 5:45 AM, Sean Matheny wrote: >>>>>> I wanted to add a specific question to the previous post, in the >>>>>> hopes it’s easier to answer. >>>>>> >>>>>> We have a Luminous monitor restored from the OSDs using >>>>>> ceph-object-tool, which seems like the best chance of any success. We >>>>>> followed this rough process: >>>>>> >>>>>> https://tracker.ceph.com/issues/24419 >>>>>> >>>>>> The monitor has come up (as a single monitor cluster), but it’s >>>>>> reporting wildly inaccurate info, such as the number of osds that are >>>>>> down (157 but all 223 are down), and hosts (1, but all 14 are down). >>>>>> >>>>> >>>>> Have you verified that the MON's database has the same epoch of the >>>>> OSDMap (or newer) as all the other OSDs? >>>>> >>>>> If the OSDs have a newer epoch of the OSDMap than the MON it won't work. >>>>> >>>>>> The OSD Daemons are still off, but I’m not sure if starting them back >>>>>> up with this monitor will make things worse. The fact that this mon >>>>>> daemon can’t even see how many OSDs are correctly down makes me think >>>>>> that nothing good will come from turning the OSDs back on. >>>>>> >>>>>> Do I run risk of further corruption (i.e. on the Ceph side, not >>>>>> client data as the cluster is paused) if I proceed and turn on the >>>>>> osd daemons? Or is it worth a shot? >>>>>> >>>>>> Are these Ceph health metrics commonly inaccurate until it can talk >>>>>> to the daemons? >>>>> >>>>> The PG stats will be inaccurate indeed and the number of OSDs can vary >>>>> as long as they aren't able to peer with each other and the MONs. >>>>> >>>>>> >>>>>> (Also other commands like `ceph osd tree` agree with the below `ceph >>>>>> -s` so far) >>>>>> >>>>>> Many thanks for any wisdom… I just don’t want to make things >>>>>> (unnecessarily) much worse. >>>>>> >>>>>> Cheers, >>>>>> Sean >>>>>> >>>>>> >>>>>> root@ntr-mon01:/var/log/ceph# ceph -s >>>>>> cluster: >>>>>> id: ababdd7f-1040-431b-962c-c45bea5424aa >>>>>> health: HEALTH_WARN >>>>>> pauserd,pausewr,noout,norecover,noscrub,nodeep-scrub >>>>>> flag(s) set >>>>>> 157 osds down >>>>>> 1 host (15 osds) down >>>>>> Reduced data availability: 12225 pgs inactive, 885 pgs >>>>>> down, 673 pgs peering >>>>>> Degraded data redundancy: 14829054/35961087 objects >>>>>> degraded (41.236%), 2869 pgs degraded, 2995 pgs undersized services: >>>>>> mon: 1 daemons, quorum ntr-mon01 >>>>>> mgr: ntr-mon01(active) >>>>>> osd: 223 osds: 66 up, 223 in >>>>>> flags pauserd,pausewr,noout,norecover,noscrub,nodeep-scrub data: >>>>>> pools: 14 pools, 15220 pgs >>>>>> objects: 10.58M objects, 40.1TiB >>>>>> usage: 43.0TiB used, 121TiB / 164TiB avail >>>>>> pgs: 70.085% pgs unknown >>>>>> 10.237% pgs not active >>>>>> 14829054/35961087 objects degraded (41.236%) >>>>>> 10667 unknown >>>>>> 2869 active+undersized+degraded >>>>>> 885 down >>>>>> 673 peering >>>>>> 126 active+undersized >>>>>> >>>>>> >>>>>> On 19/02/2020, at 10:18 AM, Sean Matheny <s.matheny@xxxxxxxxxxxxxx >>>>>> <mailto:s.matheny@xxxxxxxxxxxxxx><mailto:s.matheny@xxxxxxxxxxxxxx>> >>>>>> wrote: >>>>>> >>>>>> Hi folks, >>>>>> >>>>>> Our entire cluster is down at the moment. >>>>>> >>>>>> We started upgrading from 12.2.13 to 14.2.7 with the monitors. The >>>>>> first monitor we upgraded crashed. We reverted to luminous on this >>>>>> one and tried another, and it was fine. We upgraded the rest, and >>>>>> they all worked. >>>>>> >>>>>> Then we upgraded the first one again, and after it became the leader, >>>>>> it died. Then the second one became the leader, and it died. Then the >>>>>> third became the leader, and it died, leaving mon 4 and 5 unable to >>>>>> form a quorum. >>>>>> >>>>>> We tried creating a single monitor cluster by editing the monmap of >>>>>> mon05, and it died in the same way, just without the paxos >>>>>> negotiation first. >>>>>> >>>>>> We have tried to revert to a luminous (12.2.12) monitor backup taken >>>>>> a few hours before the crash. The mon daemon will start, but is >>>>>> flooded with blocked requests and unknown pgs after a while. For >>>>>> better or worse we removed the “noout” flag and 144 of 232 OSDs are >>>>>> now showing as down: >>>>>> >>>>>> cluster: >>>>>> id: ababdd7f-1040-431b-962c-c45bea5424aa >>>>>> health: HEALTH_ERR >>>>>> noout,nobackfill,norecover flag(s) set >>>>>> 101 osds down >>>>>> 9 hosts (143 osds) down >>>>>> 1 auth entities have invalid capabilities >>>>>> Long heartbeat ping times on back interface seen, longest >>>>>> is 15424.178 msec >>>>>> Long heartbeat ping times on front interface seen, longest >>>>>> is 14763.145 msec >>>>>> Reduced data availability: 521 pgs inactive, 48 pgs stale >>>>>> 274 slow requests are blocked > 32 sec >>>>>> 88 stuck requests are blocked > 4096 sec >>>>>> 1303 slow ops, oldest one blocked for 174 sec, >>>>>> mon.ntr-mon01 has slow ops >>>>>> too many PGs per OSD (299 > max 250) services: >>>>>> mon: 1 daemons, quorum ntr-mon01 (age 3m) >>>>>> mgr: ntr-mon01(active, since 30m) >>>>>> mds: cephfs:1 {0=akld2e18u42=up:active(laggy or crashed)} >>>>>> osd: 223 osds: 66 up, 167 in >>>>>> flags noout,nobackfill,norecover >>>>>> rgw: 2 daemons active (ntr-rgw01, ntr-rgw02) data: >>>>>> pools: 14 pools, 15220 pgs >>>>>> objects: 35.26M objects, 134 TiB >>>>>> usage: 379 TiB used, 1014 TiB / 1.4 PiB avail >>>>>> pgs: 3.423% pgs unknown >>>>>> 14651 active+clean >>>>>> 521 unknown >>>>>> 48 stale+active+clean io: >>>>>> client: 20 KiB/s rd, 439 KiB/s wr, 7 op/s rd, 54 op/s wr >>>>>> >>>>>> These luminous OSD daemons are not down, but are all in fact running. >>>>>> They just have no comms with the monitor: >>>>>> >>>>>> 2020-02-19 10:12:33.565680 7ff222e24700 1 osd.0 pg_epoch: 305104 >>>>>> pg[100.37as3( v 129516'2 (0'0,129516'2] local-lis/les=297268/297269 >>>>>> n=0 ec=129502/129502 lis/c 297268/297268 les/c/f 297269/297358/0 >>>>>> 297268/297268/161526) [41,192,216,0,160,117]p41(0) r=3 lpr=305101 >>>>>> crt=129516'2 lcod 0'0 unknown NOTIFY mbc={}] state<Start>: >>>>>> transitioning to Stray >>>>>> 2020-02-19 10:12:33.565861 7ff222e24700 1 osd.0 pg_epoch: 305104 >>>>>> pg[4.53c( v 305046'1933429 (304777'1931907,305046'1933429] >>>>>> local-lis/les=298009/298010 n=7350 ec=768/768 lis/c 298009/298009 >>>>>> les/c/f 298010/298010/0 297268/298009/298009) [0,61,103] r=0 >>>>>> lpr=305101 crt=305046'1933429 lcod 0'0 mlcod 0'0 unknown mbc={}] >>>>>> state<Start>: transitioning to Primary >>>>>> 2020-02-19 10:12:33.566742 7ff222e24700 1 osd.0 pg_epoch: 305104 >>>>>> pg[100.des4( v 129516'1 (0'0,129516'1] local-lis/les=292010/292011 >>>>>> n=1 ec=129502/129502 lis/c 292010/292010 les/c/f 292011/292417/0 >>>>>> 292010/292010/280955) [149,62,209,187,0,98]p149(0) r=4 lpr=305072 >>>>>> crt=129516'1 lcod 0'0 unknown NOTIFY mbc={}] state<Start>: >>>>>> transitioning to Stray >>>>>> 2020-02-19 10:12:33.566896 7ff23ccd9e00 0 osd.0 305104 done with >>>>>> init, starting boot process >>>>>> 2020-02-19 10:12:33.566956 7ff23ccd9e00 1 osd.0 305104 start_boot >>>>>> >>>>>> One oddity in our deployment is that there was a test mds instance, >>>>>> and it is running mimic. I shut it down, as the monitor trace has an >>>>>> MDS call in it, but the nautilus monitors still die the same way. >>>>>> >>>>>> "mds": { >>>>>> "ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) >>>>>> mimic (stable)": 1 >>>>>> }, >>>>>> >>>>>> >>>>>> ... >>>>>> -11> 2020-02-18 09:50:00.800 7fd164a1a700 5 >>>>>> mon.ntr-mon02@1(leader).paxos(paxos recovering c 85448935..85449502) >>>>>> is_readable = 0 - now=2020-02-18 09:50:00.804429 >>>>>> lease_expire=0.000000 has v0 lc 85449502 >>>>>> -10> 2020-02-18 09:50:00.800 7fd164a1a700 5 >>>>>> mon.ntr-mon02@1(leader).paxos(paxos recovering c 85448935..85449502) >>>>>> is_readable = 0 - now=2020-02-18 09:50:00.804446 >>>>>> lease_expire=0.000000 has v0 lc 85449502 >>>>>> -9> 2020-02-18 09:50:00.800 7fd164a1a700 5 >>>>>> mon.ntr-mon02@1(leader).paxos(paxos recovering c 85448935..85449502) >>>>>> is_readable = 0 - now=2020-02-18 09:50:00.804460 >>>>>> lease_expire=0.000000 has v0 lc 85449502 >>>>>> -8> 2020-02-18 09:50:00.800 7fd164a1a700 4 set_mon_vals no >>>>>> callback set >>>>>> -7> 2020-02-18 09:50:00.800 7fd164a1a700 4 mgrc handle_mgr_map Got >>>>>> map version 2301191 >>>>>> -6> 2020-02-18 09:50:00.804 7fd164a1a700 4 mgrc handle_mgr_map >>>>>> Active mgr is now v1:10.31.88.17:6801/2924412 >>>>>> -5> 2020-02-18 09:50:00.804 7fd164a1a700 0 log_channel(cluster) >>>>>> log [DBG] : monmap e25: 5 mons at >>>>>> {ntr-mon01=v1:10.31.88.14:6789/0,ntr-mon02=v1:10.31.88.15:6789/0,ntr-mon03=v1:10.31.88.16:6789/0,ntr-mon04=v1:10.31.88.17:6789/0,ntr-mon05=v1:10.31.88.18:6789/0} >>>>>> -4> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client _send_to_mon >>>>>> log to self >>>>>> -3> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client log_queue >>>>>> is 3 last_log 3 sent 2 num 3 unsent 1 sending 1 >>>>>> -2> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client will send >>>>>> 2020-02-18 09:50:00.806845 mon.ntr-mon02 (mon.1) 3 : cluster [DBG] >>>>>> monmap e25: 5 mons at >>>>>> {ntr-mon01=v1:10.31.88.14:6789/0,ntr-mon02=v1:10.31.88.15:6789/0,ntr-mon03=v1:10.31.88.16:6789/0,ntr-mon04=v1:10.31.88.17:6789/0,ntr-mon05=v1:10.31.88.18:6789/0} >>>>>> -1> 2020-02-18 09:50:00.804 7fd164a1a700 5 >>>>>> mon.ntr-mon02@1(leader).paxos(paxos active c 85448935..85449502) >>>>>> is_readable = 1 - now=2020-02-18 09:50:00.806920 >>>>>> lease_expire=2020-02-18 09:50:05.804479 has v0 lc 85449502 >>>>>> 0> 2020-02-18 09:50:00.812 7fd164a1a700 -1 *** Caught signal >>>>>> (Aborted) ** >>>>>> in thread 7fd164a1a700 thread_name:ms_dispatch >>>>>> >>>>>> ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) >>>>>> nautilus (stable) >>>>>> 1: (()+0x11390) [0x7fd171e98390] >>>>>> 2: (gsignal()+0x38) [0x7fd1715e5428] >>>>>> 3: (abort()+0x16a) [0x7fd1715e702a] >>>>>> 4: (__gnu_cxx::__verbose_terminate_handler()+0x135) [0x7fd173673bf5] >>>>>> 5: (__cxxabiv1::__terminate(void (*)())+0x6) [0x7fd173667bd6] >>>>>> 6: (()+0x8b6c21) [0x7fd173667c21] >>>>>> 7: (()+0x8c2e34) [0x7fd173673e34] >>>>>> 8: (std::__throw_out_of_range(char const*)+0x3f) [0x7fd17367f55f] >>>>>> 9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x79ae00] >>>>>> 10: (MDSMonitor::tick()+0xc9) [0x79c669] >>>>>> 11: (MDSMonitor::on_active()+0x28) [0x785e88] >>>>>> 12: (PaxosService::_active()+0xdd) [0x6d4b2d] >>>>>> 13: (Context::complete(int)+0x9) [0x600789] >>>>>> 14: (void finish_contexts<std::__cxx11::list<Context*, >>>>>> std::allocator<Context*> > >(CephContext*, >>>>>> std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xa8) >>>>>> [0x6299a8] >>>>>> 15: (Paxos::finish_round()+0x76) [0x6cb276] >>>>>> 16: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xbff) >>>>>> [0x6cc47f] >>>>>> 17: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x24b) >>>>>> [0x6ccf2b] >>>>>> 18: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x15c5) >>>>>> [0x5fa6f5] >>>>>> 19: (Monitor::_ms_dispatch(Message*)+0x4d2) [0x5fad42] >>>>>> 20: (Monitor::ms_dispatch(Message*)+0x26) [0x62b046] >>>>>> 21: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> >>>>>> const&)+0x26) [0x6270b6] >>>>>> 22: (DispatchQueue::entry()+0x1219) [0x7fd1732b7e59] >>>>>> 23: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fd17336836d] >>>>>> 24: (()+0x76ba) [0x7fd171e8e6ba] >>>>>> 25: (clone()+0x6d) [0x7fd1716b741d] >>>>>> ... >>>>>> >>>>>> Ceph versions output >>>>>> >>>>>> { >>>>>> "mon": { >>>>>> "ceph version 12.2.13 >>>>>> (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 1, >>>>>> "ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) >>>>>> nautilus (stable)": 4 >>>>>> }, >>>>>> "mgr": { >>>>>> "ceph version 12.2.12 >>>>>> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 1, >>>>>> "ceph version 12.2.13 >>>>>> (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 1, >>>>>> "ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) >>>>>> nautilus (stable)": 2 >>>>>> }, >>>>>> "osd": { >>>>>> "ceph version 12.2.11 >>>>>> (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)": 175, >>>>>> "ceph version 12.2.12 >>>>>> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 32, >>>>>> "ceph version 12.2.13 >>>>>> (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 16 >>>>>> }, >>>>>> "mds": { >>>>>> "ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) >>>>>> mimic (stable)": 1 >>>>>> }, >>>>>> "rgw": { >>>>>> "ceph version 12.2.12 >>>>>> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 2 >>>>>> }, >>>>>> "overall": { >>>>>> "ceph version 12.2.11 >>>>>> (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)": 175, >>>>>> "ceph version 12.2.12 >>>>>> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 35, >>>>>> "ceph version 12.2.13 >>>>>> (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 18, >>>>>> "ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) >>>>>> mimic (stable)": 1, >>>>>> "ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) >>>>>> nautilus (stable)": 6 >>>>>> } >>>>>> } >>>>>> >>>>>> We’ve filed a bug report with the actions of the actual cascading >>>>>> crash described above (when we upgraded mon01 and it became the leader): >>>>>> https://tracker.ceph.com/issues/44185 (parts here copied from that >>>>>> report) >>>>>> >>>>>> Right now we’re not sure what the best path to some sort of recovery >>>>>> would be. All OSD Daemons are still on Luminous, so AFAICT, we could >>>>>> build the monitor db from the OSDs with >>>>>> https://github.com/ceph/ceph/blob/luminous/doc/rados/troubleshooting/troubleshooting-mon.rst#recovery-using-osds >>>>>> which describes using this script: >>>>>> >>>>>> >>>>>> #!/bin/bash >>>>>> hosts="ntr-sto01 ntr-sto02" >>>>>> ms=/tmp/mon-store/ >>>>>> mkdir $ms >>>>>> # collect the cluster map from OSDs >>>>>> for host in $hosts; do >>>>>> echo $host >>>>>> rsync -avz $ms root@$host:$ms >>>>>> rm -rf $ms >>>>>> ssh root@$host <<EOF >>>>>> for osd in /var/lib/ceph/osd/ceph-*; do >>>>>> ceph-objectstore-tool --data-path \$osd --op update-mon-db >>>>>> --mon-store-path $ms >>>>>> done >>>>>> EOF >>>>>> rsync -avz root@$host:$ms $ms >>>>>> done >>>>>> >>>>>> If this is our best idea to try, should we try the mon store from the >>>>>> above script on a luminous or nautilus mon daemon? Any other ideas to >>>>>> try at this dark hour? : \ >>>>>> >>>>>> Cheers, >>>>>> Sean >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>> <mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx> >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>> <mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx> >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx> >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>> <mailto:ceph-users-leave@xxxxxxx> >>>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx