Re: [FORGED] Lost all Monitors in Nautilus Upgrade, best way forward?

Wido den Hollander <wido@xxxxxxxx> · Wed, 19 Feb 2020 11:42:18 +0100

On 2/19/20 10:11 AM, Paul Emmerich wrote:
> On Wed, Feb 19, 2020 at 10:03 AM Wido den Hollander <wido@xxxxxxxx> wrote:
>>
>>
>>
>> On 2/19/20 8:49 AM, Sean Matheny wrote:
>>> Thanks,
>>>
>>>> If the OSDs have a newer epoch of the OSDMap than the MON it won't work.
>>>
>>> How can I verify this? (i.e the epoch of the monitor vs the epoch of the
>>> osd(s))
>>>
>>
>> Check the status of the OSDs:
>>
>> $ ceph daemon osd.X status
>>
>> This should tell the newest map it has.
>>
>> Then check on the mons:
>>
>> $ ceph osd dump|head -n 10
> 
> mons are offline

I think he said het got one MON back manually. His 'ceph -s' also show
it :-)

> 
>> Or using ceph-monstore-tool to see what the latest map is the MON has.
> 
> ceph-monstore-tool <mon-dir> dump-keys
> 
> Also useful:
> 
> ceph-monstore-tool <mon-dir> get osdmap
> 

Indeed. My thought is that there is a mismatch in OSDMaps between the
MONs and OSDs which is causing these problems.

Wido

> Paul
> 
>>
>> Wido
>>
>>> Cheers,
>>> Sean
>>>
>>>
>>>> On 19/02/2020, at 7:25 PM, Wido den Hollander <wido@xxxxxxxx
>>>> <mailto:wido@xxxxxxxx>> wrote:
>>>>
>>>>
>>>>
>>>> On 2/19/20 5:45 AM, Sean Matheny wrote:
>>>>> I wanted to add a specific question to the previous post, in the
>>>>> hopes it’s easier to answer.
>>>>>
>>>>> We have a Luminous monitor restored from the OSDs using
>>>>> ceph-object-tool, which seems like the best chance of any success. We
>>>>> followed this rough process:
>>>>>
>>>>> https://tracker.ceph.com/issues/24419
>>>>>
>>>>> The monitor has come up (as a single monitor cluster), but it’s
>>>>> reporting wildly inaccurate info, such as the number of osds that are
>>>>> down (157 but all 223 are down), and hosts (1, but all 14 are down).
>>>>>
>>>>
>>>> Have you verified that the MON's database has the same epoch of the
>>>> OSDMap (or newer) as all the other OSDs?
>>>>
>>>> If the OSDs have a newer epoch of the OSDMap than the MON it won't work.
>>>>
>>>>> The OSD Daemons are still off, but I’m not sure if starting them back
>>>>> up with this monitor will make things worse. The fact that this mon
>>>>> daemon can’t even see how many OSDs are correctly down makes me think
>>>>> that nothing good will come from turning the OSDs back on.
>>>>>
>>>>> Do I run risk of further corruption (i.e. on the Ceph side, not
>>>>> client data as the cluster is paused) if I proceed and turn on the
>>>>> osd daemons? Or is it worth a shot?
>>>>>
>>>>> Are these Ceph health metrics commonly inaccurate until it can talk
>>>>> to the daemons?
>>>>
>>>> The PG stats will be inaccurate indeed and the number of OSDs can vary
>>>> as long as they aren't able to peer with each other and the MONs.
>>>>
>>>>>
>>>>> (Also other commands like `ceph osd tree` agree with the below `ceph
>>>>> -s` so far)
>>>>>
>>>>> Many thanks for any wisdom… I just don’t want to make things
>>>>> (unnecessarily) much worse.
>>>>>
>>>>> Cheers,
>>>>> Sean
>>>>>
>>>>>
>>>>> root@ntr-mon01:/var/log/ceph# ceph -s
>>>>>  cluster:
>>>>>    id:     ababdd7f-1040-431b-962c-c45bea5424aa
>>>>>    health: HEALTH_WARN
>>>>>            pauserd,pausewr,noout,norecover,noscrub,nodeep-scrub
>>>>> flag(s) set
>>>>>            157 osds down
>>>>>            1 host (15 osds) down
>>>>>            Reduced data availability: 12225 pgs inactive, 885 pgs
>>>>> down, 673 pgs peering
>>>>>            Degraded data redundancy: 14829054/35961087 objects
>>>>> degraded (41.236%), 2869 pgs degraded, 2995 pgs undersized  services:
>>>>>    mon: 1 daemons, quorum ntr-mon01
>>>>>    mgr: ntr-mon01(active)
>>>>>    osd: 223 osds: 66 up, 223 in
>>>>>         flags pauserd,pausewr,noout,norecover,noscrub,nodeep-scrub  data:
>>>>>    pools:   14 pools, 15220 pgs
>>>>>    objects: 10.58M objects, 40.1TiB
>>>>>    usage:   43.0TiB used, 121TiB / 164TiB avail
>>>>>    pgs:     70.085% pgs unknown
>>>>>             10.237% pgs not active
>>>>>             14829054/35961087 objects degraded (41.236%)
>>>>>             10667 unknown
>>>>>             2869  active+undersized+degraded
>>>>>             885   down
>>>>>             673   peering
>>>>>             126   active+undersized
>>>>>
>>>>>
>>>>> On 19/02/2020, at 10:18 AM, Sean Matheny <s.matheny@xxxxxxxxxxxxxx
>>>>> <mailto:s.matheny@xxxxxxxxxxxxxx><mailto:s.matheny@xxxxxxxxxxxxxx>>
>>>>> wrote:
>>>>>
>>>>> Hi folks,
>>>>>
>>>>> Our entire cluster is down at the moment.
>>>>>
>>>>> We started upgrading from 12.2.13 to 14.2.7 with the monitors. The
>>>>> first monitor we upgraded crashed. We reverted to luminous on this
>>>>> one and tried another, and it was fine. We upgraded the rest, and
>>>>> they all worked.
>>>>>
>>>>> Then we upgraded the first one again, and after it became the leader,
>>>>> it died. Then the second one became the leader, and it died. Then the
>>>>> third became the leader, and it died, leaving mon 4 and 5 unable to
>>>>> form a quorum.
>>>>>
>>>>> We tried creating a single monitor cluster by editing the monmap of
>>>>> mon05, and it died in the same way, just without the paxos
>>>>> negotiation first.
>>>>>
>>>>> We have tried to revert to a luminous (12.2.12) monitor backup taken
>>>>> a few hours before the crash. The mon daemon will start, but is
>>>>> flooded with blocked requests and unknown pgs after a while. For
>>>>> better or worse we removed the “noout” flag and 144 of 232 OSDs are
>>>>> now showing as down:
>>>>>
>>>>> cluster:
>>>>>   id:     ababdd7f-1040-431b-962c-c45bea5424aa
>>>>>   health: HEALTH_ERR
>>>>>           noout,nobackfill,norecover flag(s) set
>>>>>           101 osds down
>>>>>           9 hosts (143 osds) down
>>>>>           1 auth entities have invalid capabilities
>>>>>           Long heartbeat ping times on back interface seen, longest
>>>>> is 15424.178 msec
>>>>>           Long heartbeat ping times on front interface seen, longest
>>>>> is 14763.145 msec
>>>>>           Reduced data availability: 521 pgs inactive, 48 pgs stale
>>>>>           274 slow requests are blocked > 32 sec
>>>>>           88 stuck requests are blocked > 4096 sec
>>>>>           1303 slow ops, oldest one blocked for 174 sec,
>>>>> mon.ntr-mon01 has slow ops
>>>>>           too many PGs per OSD (299 > max 250)  services:
>>>>>   mon: 1 daemons, quorum ntr-mon01 (age 3m)
>>>>>   mgr: ntr-mon01(active, since 30m)
>>>>>   mds: cephfs:1 {0=akld2e18u42=up:active(laggy or crashed)}
>>>>>   osd: 223 osds: 66 up, 167 in
>>>>>        flags noout,nobackfill,norecover
>>>>>   rgw: 2 daemons active (ntr-rgw01, ntr-rgw02)  data:
>>>>>   pools:   14 pools, 15220 pgs
>>>>>   objects: 35.26M objects, 134 TiB
>>>>>   usage:   379 TiB used, 1014 TiB / 1.4 PiB avail
>>>>>   pgs:     3.423% pgs unknown
>>>>>            14651 active+clean
>>>>>            521   unknown
>>>>>            48    stale+active+clean  io:
>>>>>   client:   20 KiB/s rd, 439 KiB/s wr, 7 op/s rd, 54 op/s wr
>>>>>
>>>>> These luminous OSD daemons are not down, but are all in fact running.
>>>>> They just have no comms with the monitor:
>>>>>
>>>>> 2020-02-19 10:12:33.565680 7ff222e24700  1 osd.0 pg_epoch: 305104
>>>>> pg[100.37as3( v 129516'2 (0'0,129516'2] local-lis/les=297268/297269
>>>>> n=0 ec=129502/129502 lis/c 297268/297268 les/c/f 297269/297358/0
>>>>> 297268/297268/161526) [41,192,216,0,160,117]p41(0) r=3 lpr=305101
>>>>> crt=129516'2 lcod 0'0 unknown NOTIFY mbc={}] state<Start>:
>>>>> transitioning to Stray
>>>>> 2020-02-19 10:12:33.565861 7ff222e24700  1 osd.0 pg_epoch: 305104
>>>>> pg[4.53c( v 305046'1933429 (304777'1931907,305046'1933429]
>>>>> local-lis/les=298009/298010 n=7350 ec=768/768 lis/c 298009/298009
>>>>> les/c/f 298010/298010/0 297268/298009/298009) [0,61,103] r=0
>>>>> lpr=305101 crt=305046'1933429 lcod 0'0 mlcod 0'0 unknown mbc={}]
>>>>> state<Start>: transitioning to Primary
>>>>> 2020-02-19 10:12:33.566742 7ff222e24700  1 osd.0 pg_epoch: 305104
>>>>> pg[100.des4( v 129516'1 (0'0,129516'1] local-lis/les=292010/292011
>>>>> n=1 ec=129502/129502 lis/c 292010/292010 les/c/f 292011/292417/0
>>>>> 292010/292010/280955) [149,62,209,187,0,98]p149(0) r=4 lpr=305072
>>>>> crt=129516'1 lcod 0'0 unknown NOTIFY mbc={}] state<Start>:
>>>>> transitioning to Stray
>>>>> 2020-02-19 10:12:33.566896 7ff23ccd9e00  0 osd.0 305104 done with
>>>>> init, starting boot process
>>>>> 2020-02-19 10:12:33.566956 7ff23ccd9e00  1 osd.0 305104 start_boot
>>>>>
>>>>> One oddity in our deployment is that there was a test mds instance,
>>>>> and it is running mimic. I shut it down, as the monitor trace has an
>>>>> MDS call in it, but the nautilus monitors still die the same way.
>>>>>
>>>>>  "mds": {
>>>>>       "ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0)
>>>>> mimic (stable)": 1
>>>>>   },
>>>>>
>>>>>
>>>>> ...
>>>>>  -11> 2020-02-18 09:50:00.800 7fd164a1a700  5
>>>>> mon.ntr-mon02@1(leader).paxos(paxos recovering c 85448935..85449502)
>>>>> is_readable = 0 - now=2020-02-18 09:50:00.804429
>>>>> lease_expire=0.000000 has v0 lc 85449502
>>>>>  -10> 2020-02-18 09:50:00.800 7fd164a1a700  5
>>>>> mon.ntr-mon02@1(leader).paxos(paxos recovering c 85448935..85449502)
>>>>> is_readable = 0 - now=2020-02-18 09:50:00.804446
>>>>> lease_expire=0.000000 has v0 lc 85449502
>>>>>   -9> 2020-02-18 09:50:00.800 7fd164a1a700  5
>>>>> mon.ntr-mon02@1(leader).paxos(paxos recovering c 85448935..85449502)
>>>>> is_readable = 0 - now=2020-02-18 09:50:00.804460
>>>>> lease_expire=0.000000 has v0 lc 85449502
>>>>>   -8> 2020-02-18 09:50:00.800 7fd164a1a700  4 set_mon_vals no
>>>>> callback set
>>>>>   -7> 2020-02-18 09:50:00.800 7fd164a1a700  4 mgrc handle_mgr_map Got
>>>>> map version 2301191
>>>>>   -6> 2020-02-18 09:50:00.804 7fd164a1a700  4 mgrc handle_mgr_map
>>>>> Active mgr is now v1:10.31.88.17:6801/2924412
>>>>>   -5> 2020-02-18 09:50:00.804 7fd164a1a700  0 log_channel(cluster)
>>>>> log [DBG] : monmap e25: 5 mons at
>>>>> {ntr-mon01=v1:10.31.88.14:6789/0,ntr-mon02=v1:10.31.88.15:6789/0,ntr-mon03=v1:10.31.88.16:6789/0,ntr-mon04=v1:10.31.88.17:6789/0,ntr-mon05=v1:10.31.88.18:6789/0}
>>>>>   -4> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client _send_to_mon
>>>>> log to self
>>>>>   -3> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client  log_queue
>>>>> is 3 last_log 3 sent 2 num 3 unsent 1 sending 1
>>>>>   -2> 2020-02-18 09:50:00.804 7fd164a1a700 10 log_client  will send
>>>>> 2020-02-18 09:50:00.806845 mon.ntr-mon02 (mon.1) 3 : cluster [DBG]
>>>>> monmap e25: 5 mons at
>>>>> {ntr-mon01=v1:10.31.88.14:6789/0,ntr-mon02=v1:10.31.88.15:6789/0,ntr-mon03=v1:10.31.88.16:6789/0,ntr-mon04=v1:10.31.88.17:6789/0,ntr-mon05=v1:10.31.88.18:6789/0}
>>>>>   -1> 2020-02-18 09:50:00.804 7fd164a1a700  5
>>>>> mon.ntr-mon02@1(leader).paxos(paxos active c 85448935..85449502)
>>>>> is_readable = 1 - now=2020-02-18 09:50:00.806920
>>>>> lease_expire=2020-02-18 09:50:05.804479 has v0 lc 85449502
>>>>>    0> 2020-02-18 09:50:00.812 7fd164a1a700 -1 *** Caught signal
>>>>> (Aborted) **
>>>>> in thread 7fd164a1a700 thread_name:ms_dispatch
>>>>>
>>>>> ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8)
>>>>> nautilus (stable)
>>>>> 1: (()+0x11390) [0x7fd171e98390]
>>>>> 2: (gsignal()+0x38) [0x7fd1715e5428]
>>>>> 3: (abort()+0x16a) [0x7fd1715e702a]
>>>>> 4: (__gnu_cxx::__verbose_terminate_handler()+0x135) [0x7fd173673bf5]
>>>>> 5: (__cxxabiv1::__terminate(void (*)())+0x6) [0x7fd173667bd6]
>>>>> 6: (()+0x8b6c21) [0x7fd173667c21]
>>>>> 7: (()+0x8c2e34) [0x7fd173673e34]
>>>>> 8: (std::__throw_out_of_range(char const*)+0x3f) [0x7fd17367f55f]
>>>>> 9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x79ae00]
>>>>> 10: (MDSMonitor::tick()+0xc9) [0x79c669]
>>>>> 11: (MDSMonitor::on_active()+0x28) [0x785e88]
>>>>> 12: (PaxosService::_active()+0xdd) [0x6d4b2d]
>>>>> 13: (Context::complete(int)+0x9) [0x600789]
>>>>> 14: (void finish_contexts<std::__cxx11::list<Context*,
>>>>> std::allocator<Context*> > >(CephContext*,
>>>>> std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xa8)
>>>>> [0x6299a8]
>>>>> 15: (Paxos::finish_round()+0x76) [0x6cb276]
>>>>> 16: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xbff)
>>>>> [0x6cc47f]
>>>>> 17: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x24b)
>>>>> [0x6ccf2b]
>>>>> 18: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x15c5)
>>>>> [0x5fa6f5]
>>>>> 19: (Monitor::_ms_dispatch(Message*)+0x4d2) [0x5fad42]
>>>>> 20: (Monitor::ms_dispatch(Message*)+0x26) [0x62b046]
>>>>> 21: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message>
>>>>> const&)+0x26) [0x6270b6]
>>>>> 22: (DispatchQueue::entry()+0x1219) [0x7fd1732b7e59]
>>>>> 23: (DispatchQueue::DispatchThread::entry()+0xd) [0x7fd17336836d]
>>>>> 24: (()+0x76ba) [0x7fd171e8e6ba]
>>>>> 25: (clone()+0x6d) [0x7fd1716b741d]
>>>>> ...
>>>>>
>>>>> Ceph versions output
>>>>>
>>>>> {
>>>>>   "mon": {
>>>>>       "ceph version 12.2.13
>>>>> (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 1,
>>>>>       "ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8)
>>>>> nautilus (stable)": 4
>>>>>   },
>>>>>   "mgr": {
>>>>>       "ceph version 12.2.12
>>>>> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 1,
>>>>>       "ceph version 12.2.13
>>>>> (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 1,
>>>>>       "ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8)
>>>>> nautilus (stable)": 2
>>>>>   },
>>>>>   "osd": {
>>>>>       "ceph version 12.2.11
>>>>> (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)": 175,
>>>>>       "ceph version 12.2.12
>>>>> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 32,
>>>>>       "ceph version 12.2.13
>>>>> (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 16
>>>>>   },
>>>>>   "mds": {
>>>>>       "ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0)
>>>>> mimic (stable)": 1
>>>>>   },
>>>>>   "rgw": {
>>>>>       "ceph version 12.2.12
>>>>> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 2
>>>>>   },
>>>>>   "overall": {
>>>>>       "ceph version 12.2.11
>>>>> (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)": 175,
>>>>>       "ceph version 12.2.12
>>>>> (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 35,
>>>>>       "ceph version 12.2.13
>>>>> (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 18,
>>>>>       "ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0)
>>>>> mimic (stable)": 1,
>>>>>       "ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8)
>>>>> nautilus (stable)": 6
>>>>>   }
>>>>> }
>>>>>
>>>>> We’ve filed a bug report with the actions of the actual cascading
>>>>> crash described above (when we upgraded mon01 and it became the leader):
>>>>> https://tracker.ceph.com/issues/44185 (parts here copied from that
>>>>> report)
>>>>>
>>>>> Right now we’re not sure what the best path to some sort of recovery
>>>>> would be. All OSD Daemons are still on Luminous, so AFAICT, we could
>>>>> build the monitor db from the OSDs with
>>>>> https://github.com/ceph/ceph/blob/luminous/doc/rados/troubleshooting/troubleshooting-mon.rst#recovery-using-osds
>>>>> which describes using this script:
>>>>>
>>>>>
>>>>> #!/bin/bash
>>>>> hosts="ntr-sto01 ntr-sto02"
>>>>> ms=/tmp/mon-store/
>>>>> mkdir $ms
>>>>> # collect the cluster map from OSDs
>>>>> for host in $hosts; do
>>>>> echo $host
>>>>> rsync -avz $ms root@$host:$ms
>>>>> rm -rf $ms
>>>>> ssh root@$host <<EOF
>>>>>   for osd in /var/lib/ceph/osd/ceph-*; do
>>>>>     ceph-objectstore-tool --data-path \$osd --op update-mon-db
>>>>> --mon-store-path $ms
>>>>>   done
>>>>> EOF
>>>>> rsync -avz root@$host:$ms $ms
>>>>> done
>>>>>
>>>>> If this is our best idea to try, should we try the mon store from the
>>>>> above script on a luminous or nautilus mon daemon? Any other ideas to
>>>>> try at this dark hour? : \
>>>>>
>>>>> Cheers,
>>>>> Sean
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> <mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx>
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>> <mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>> <mailto:ceph-users-leave@xxxxxxx>
>>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx