Re: mgr daemons becoming unresponsive

Thomas Schneider <74cmonty@xxxxxxxxx> · Wed, 6 Nov 2019 17:48:38 +0100

Well, even after restarting the MGR service the relevant log is spoiled
with this error messages:
2019-11-06 17:46:22.363 7f81ffdcc700  0 auth: could not find secret_id=3865
2019-11-06 17:46:22.363 7f81ffdcc700  0 cephx: verify_authorizer could
not get service secret for service mgr secret_id=3865

As you can see the secret_id changes.
However I have no idea what's the related service for this secret_id.

And in my opinion these errors are preventing the MGR from doing it's
job: bringing the cluster to Healthy state.

Am 06.11.2019 um 17:41 schrieb Mac Wynkoop:
> I actually just had some unresponsive mgr daemons. If It happens
> again, I'll see if it's the same error if it happens again. Restarting
> them fixed the issue.
> Mac Wynkoop
>
>
>
>
> On Wed, Nov 6, 2019 at 8:43 AM Thomas Schneider <74cmonty@xxxxxxxxx
> <mailto:74cmonty@xxxxxxxxx>> wrote:
>
>     Hi,
>
>     does anybody get this error messages in MGR log?
>     2019-11-06 15:41:44.765 7f10db740700  0 auth: could not find
>     secret_id=3863
>     2019-11-06 15:41:44.765 7f10db740700  0 cephx: verify_authorizer could
>     not get service secret for service mgr secret_id=3863
>
>
>     THX
>
>     Am 06.11.2019 um 10:43 schrieb Oliver Freyermuth:
>     > Hi together,
>     >
>     > interestingly, now that the third mon is missing for almost a week
>     > (those planned interventions always take longer than expected...),
>     > we get mgr failovers (but without crashes).
>     >
>     > In the mgr log, I find:
>     >
>     > 2019-11-06 07:50:05.409 7fce8a0dc700  0 client.0 ms_handle_reset on
>     > v2:10.160.16.1:6800/618072 <http://10.160.16.1:6800/618072>
>     > ...
>     > ... the normal churning ...
>     > ...
>     > 2019-11-06 07:52:44.113 7fce8a0dc700 -1 mgr handle_mgr_map I was
>     > active but no longer am
>     > 2019-11-06 07:52:44.113 7fce8a0dc700  1 mgr respawn  e:
>     > '/usr/bin/ceph-mgr'
>     >
>     > In the mon log, I see:
>     > ...
>     > 2019-11-06 07:44:11.565 7f1f44453700  4 rocksdb:
>     > [db/db_impl_files.cc:356] [JOB 225] Try to delete WAL files size
>     > 10830909, prev total WAL file size 10839895, number of live WAL
>     files 2.
>     >
>     > 2019-11-06 07:44:11.565 7f1f3a43f700  4 rocksdb:
>     > [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction
>     > starting
>     > 2019-11-06 07:44:11.565 7f1f44c54700  4 rocksdb: (Original Log Time
>     > 2019/11/06-07:44:11.565802) [db/db_impl_compaction_flush.cc:2374]
>     > [default] Manual compaction from level-0 to level-6 from 'mgrstat ..
>     > 'mgrstat; will stop at (end)
>     > ...
>     > 2019-11-06 07:50:36.734 7f1f3a43f700  4 rocksdb:
>     > [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction
>     > starting
>     > 2019-11-06 07:52:27.046 7f1f4144d700  0 log_channel(cluster) log
>     [INF]
>     > : Manager daemon mon001 is unresponsive, replacing it with standby
>     > daemon mon002
>     > ...
>     >
>     > There's a lot of compaction going on (probably due to the prolonged
>     > HEALTH_WARN state, so not really unexpected)
>     > so I wonder whether the actual cause for identifying the mgr as
>     > "unresponsive" is the heavy compaction on the mons.
>     > It will be interesting to see what happens when we finally have the
>     > third mon back and the cluster becomes healthy again...
>     >
>     > Did somebody see something similar after running for a week or more
>     > with Nautilus on old and slow hardware?
>     >
>     > Cheers,
>     >     Oliver
>     >
>     > Am 02.11.19 um 18:20 schrieb Oliver Freyermuth:
>     >> Dear Sage,
>     >>
>     >> good news - it happened again, with debug logs!
>     >> There's nothing obvious to my eye, it's uploaded as:
>     >> 0b2d0c09-46f3-4126-aa27-e2d2e8572741
>     >> It seems the failure was roughly in parallel to me wanting to
>     access
>     >> the dashboard. It must have happened within the last ~5-10
>     minutes of
>     >> the log.
>     >>
>     >> I'll now go back to "stable operation", in case you need anything
>     >> else, just let me know.
>     >>
>     >> Cheers and all the best,
>     >>     Oliver
>     >>
>     >> Am 02.11.19 um 17:38 schrieb Oliver Freyermuth:
>     >>> Dear Sage,
>     >>>
>     >>> at least for the simple case:
>     >>>   ceph device get-health-metrics osd.11
>     >>> => mgr crashes (but in that case, it crashes fully, i.e. the
>     process
>     >>> is gone)
>     >>> I have now uploaded a verbose log as:
>     >>> ceph-post-file: e3bd60ad-cbce-4308-8b07-7ebe7998572e
>     >>>
>     >>> One potential cause of this (and maybe the other issues) might be
>     >>> because some of our OSDs are on non-JBOD controllers and hence are
>     >>> made by forming a Raid 0 per disk,
>     >>> so a simple smartctl on the device will not work (but
>     >>> -dmegaraid,<number> would be needed).
>     >>>
>     >>> Now I have both mgrs active again, debug logging on, device health
>     >>> metrics on again,
>     >>> and am waiting for them to become silent again. Let's hope the
>     issue
>     >>> reappears before the disks run full of logs ;-).
>     >>>
>     >>> Cheers,
>     >>>     Oliver
>     >>>
>     >>> Am 02.11.19 um 02:56 schrieb Sage Weil:
>     >>>> On Sat, 2 Nov 2019, Oliver Freyermuth wrote:
>     >>>>> Dear Cephers,
>     >>>>>
>     >>>>> interestingly, after:
>     >>>>>   ceph device monitoring off
>     >>>>> the mgrs seem to be stable now - the active one still went
>     silent
>     >>>>> a few minutes later,
>     >>>>> but the standby took over and was stable, and restarting the
>     >>>>> broken one, it's now stable since an hour, too,
>     >>>>> so probably, a restart of the mgr is needed after disabling
>     device
>     >>>>> monitoring to get things stable again.
>     >>>>>
>     >>>>> So it seems to be caused by a problem with the device health
>     >>>>> metrics. In case this is a red herring and mgrs become instable
>     >>>>> again in the next days,
>     >>>>> I'll let you know.
>     >>>>
>     >>>> If this seems to stabilize things, and you can tolerate
>     inducing the
>     >>>> failure again, reproducing the problem with mgr logs cranked up
>     >>>> (debug_mgr
>     >>>> = 20, debug_ms = 1) would probably give us a good idea of why the
>     >>>> mgr is
>     >>>> hanging.  Let us know!
>     >>>>
>     >>>> Thanks,
>     >>>> sage
>     >>>>
>     >>>>   >
>     >>>>> Cheers,
>     >>>>>     Oliver
>     >>>>>
>     >>>>> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:
>     >>>>>> Dear Cephers,
>     >>>>>>
>     >>>>>> this is a 14.2.4 cluster with device health metrics enabled -
>     >>>>>> since about a day, all mgr daemons go "silent" on me after
>     a few
>     >>>>>> hours, i.e. "ceph -s" shows:
>     >>>>>>
>     >>>>>>    cluster:
>     >>>>>>      id:     269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
>     >>>>>>      health: HEALTH_WARN
>     >>>>>>              no active mgr
>     >>>>>>              1/3 mons down, quorum mon001,mon002
>     >>>>>>      services:
>     >>>>>>      mon:        3 daemons, quorum mon001,mon002 (age 57m), out
>     >>>>>> of quorum: mon003
>     >>>>>>      mgr:        no daemons active (since 56m)
>     >>>>>>      ...
>     >>>>>> (the third mon has a planned outage and will come back in a few
>     >>>>>> days)
>     >>>>>>
>     >>>>>> Checking the logs of the mgr daemons, I find some "reset"
>     >>>>>> messages at the time when it goes "silent", first for the
>     first mgr:
>     >>>>>>
>     >>>>>> 2019-11-01 21:34:40.286 7f2df6a6b700  0
>     log_channel(cluster) log
>     >>>>>> [DBG] : pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data,
>     >>>>>> 2.3 TiB used, 136 TiB / 138 TiB avail
>     >>>>>> 2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0
>     ms_handle_reset
>     >>>>>> on v2:10.160.16.1:6800/401248 <http://10.160.16.1:6800/401248>
>     >>>>>> 2019-11-01 21:34:42.287 7f2df6a6b700  0
>     log_channel(cluster) log
>     >>>>>> [DBG] : pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data,
>     >>>>>> 2.3 TiB used, 136 TiB / 138 TiB avail
>     >>>>>>
>     >>>>>> and a bit later, on the standby mgr:
>     >>>>>>
>     >>>>>> 2019-11-01 22:18:14.892 7f7bcc8ae700  0
>     log_channel(cluster) log
>     >>>>>> [DBG] : pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, 858
>     >>>>>> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3
>     >>>>>> TiB used, 136 TiB / 138 TiB avail
>     >>>>>> 2019-11-01 22:18:16.022 7f7be9e72700  0 client.0
>     ms_handle_reset
>     >>>>>> on v2:10.160.16.2:6800/352196 <http://10.160.16.2:6800/352196>
>     >>>>>> 2019-11-01 22:18:16.893 7f7bcc8ae700  0
>     log_channel(cluster) log
>     >>>>>> [DBG] : pgmap v1799: 1585 pgs: 166 active+clean+snaptrim, 858
>     >>>>>> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3
>     >>>>>> TiB used, 136 TiB / 138 TiB avail
>     >>>>>>
>     >>>>>> Interestingly, the dashboard still works, but presents outdated
>     >>>>>> information, and for example zero I/O going on.
>     >>>>>> I believe this started to happen mainly after the third mon
>     went
>     >>>>>> into the known downtime, but I am not fully sure if this
>     was the
>     >>>>>> trigger, since the cluster is still growing.
>     >>>>>> It may also have been the addition of 24 more OSDs.
>     >>>>>>
>     >>>>>>
>     >>>>>> I also find other messages in the mgr logs which seem
>     >>>>>> problematic, but I am not sure they are related:
>     >>>>>> ------------------------------
>     >>>>>> 2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error
>     >>>>>> reading OMAP: [errno 22] Failed to operate read op for oid
>     >>>>>> Traceback (most recent call last):
>     >>>>>>    File "/usr/share/ceph/mgr/devicehealth/module.py", line 396,
>     >>>>>> in put_device_metrics
>     >>>>>>      ioctx.operate_read_op(op, devid)
>     >>>>>>    File "rados.pyx", line 516, in
>     >>>>>> rados.requires.wrapper.validate_func
>     >>>>>>
>     (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL
>     >>>>>> D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
>     >>>>>>    File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op
>     >>>>>>
>     (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554)
>     >>>>>> InvalidArgumentError: [errno 22] Failed to operate read op
>     for oid
>     >>>>>> ------------------------------
>     >>>>>> or:
>     >>>>>> ------------------------------
>     >>>>>> 2019-11-01 21:33:53.977 7f7bd38bc700  0 mgr[devicehealth]
>     Fail to
>     >>>>>> parse JSON result from daemon osd.51 ()
>     >>>>>> 2019-11-01 21:33:53.978 7f7bd38bc700  0 mgr[devicehealth]
>     Fail to
>     >>>>>> parse JSON result from daemon osd.52 ()
>     >>>>>> 2019-11-01 21:33:53.979 7f7bd38bc700  0 mgr[devicehealth]
>     Fail to
>     >>>>>> parse JSON result from daemon osd.53 ()
>     >>>>>> ------------------------------
>     >>>>>>
>     >>>>>> The reason why I am cautious about the health metrics is that I
>     >>>>>> observed a crash when trying to query them:
>     >>>>>> ------------------------------
>     >>>>>> 2019-11-01 20:21:23.661 7fa46314a700  0 log_channel(audit) log
>     >>>>>> [DBG] : from='client.174136 -' entity='client.admin'
>     >>>>>> cmd=[{"prefix": "device get-health-metrics", "devid": "osd.11",
>     >>>>>> "target": ["mgr", ""]}]: dispatch
>     >>>>>> 2019-11-01 20:21:23.661 7fa46394b700  0 mgr[devicehealth]
>     >>>>>> handle_command
>     >>>>>> 2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal
>     >>>>>> (Segmentation fault) **
>     >>>>>>   in thread 7fa46394b700 thread_name:mgr-fin
>     >>>>>>
>     >>>>>>   ceph version 14.2.4
>     (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
>     >>>>>> nautilus (stable)
>     >>>>>>   1: (()+0xf5f0) [0x7fa488cee5f0]
>     >>>>>>   2: (PyEval_EvalFrameEx()+0x1a9) [0x7fa48aeb50f9]
>     >>>>>>   3: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
>     >>>>>>   4: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
>     >>>>>>   5: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
>     >>>>>>   6: (PyEval_EvalCodeEx()+0x7ed) [0x7fa48aebe08d]
>     >>>>>>   7: (()+0x709c8) [0x7fa48ae479c8]
>     >>>>>>   8: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
>     >>>>>>   9: (()+0x5aaa5) [0x7fa48ae31aa5]
>     >>>>>>   10: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
>     >>>>>>   11: (()+0x4bb95) [0x7fa48ae22b95]
>     >>>>>>   12: (PyObject_CallMethod()+0xbb) [0x7fa48ae22ecb]
>     >>>>>>   13: (ActivePyModule::handle_command(std::map<std::string,
>     >>>>>> boost::variant<std::string, bool, long, double,
>     >>>>>> std::vector<std::string, std::allocator<std::string> >,
>     >>>>>> std::vector<long, std::allocator<long> >, std::vector<double,
>     >>>>>> std::allocator<double> > >, std::less<void>,
>     >>>>>> std::allocator<std::pair<std::string const,
>     >>>>>> boost::variant<std::string, bool, long, double,
>     >>>>>> std::vector<std::string, std::allocator<std::string> >,
>     >>>>>> std::vector<long, std::allocator<long> >, std::vector<double,
>     >>>>>> std::allocator<double> > > > > > const&,
>     >>>>>> ceph::buffer::v14_2_0::list const&,
>     std::basic_stringstream<char,
>     >>>>>> std::char_traits<char>, std::allocator<char> >*,
>     >>>>>> std::basic_stringstream<char, std::char_traits<char>,
>     >>>>>> std::allocator<char> >*)+0x20e) [0x55c3c1fefc5e]
>     >>>>>>   14: (()+0x16c23d) [0x55c3c204023d]
>     >>>>>>   15: (FunctionContext::finish(int)+0x2c) [0x55c3c2001eac]
>     >>>>>>   16: (Context::complete(int)+0x9) [0x55c3c1ffe659]
>     >>>>>>   17: (Finisher::finisher_thread_entry()+0x156)
>     [0x7fa48b439cc6]
>     >>>>>>   18: (()+0x7e65) [0x7fa488ce6e65]
>     >>>>>>   19: (clone()+0x6d) [0x7fa48799488d]
>     >>>>>>   NOTE: a copy of the executable, or `objdump -rdS
>     <executable>`
>     >>>>>> is needed to interpret this.
>     >>>>>> ------------------------------
>     >>>>>>
>     >>>>>> I have issued:
>     >>>>>> ceph device monitoring off
>     >>>>>> for now and will keep waiting to see if mgrs go silent
>     again. If
>     >>>>>> there are any better ideas or this issue is known, let me know.
>     >>>>>>
>     >>>>>> Cheers,
>     >>>>>>     Oliver
>     >>>>>>
>     >>>>>>
>     >>>>>> _______________________________________________
>     >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>     <mailto:ceph-users@xxxxxxx>
>     >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>     <mailto:ceph-users-leave@xxxxxxx>
>     >>>>>>
>     >>>>>
>     >>>
>     >>>
>     >>> _______________________________________________
>     >>> ceph-users mailing list -- ceph-users@xxxxxxx
>     <mailto:ceph-users@xxxxxxx>
>     >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>     <mailto:ceph-users-leave@xxxxxxx>
>     >>>
>     >>
>     >>
>     >>
>     >> _______________________________________________
>     >> ceph-users mailing list -- ceph-users@xxxxxxx
>     <mailto:ceph-users@xxxxxxx>
>     >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>     <mailto:ceph-users-leave@xxxxxxx>
>     >>
>     >
>     >
>     > _______________________________________________
>     > ceph-users mailing list -- ceph-users@xxxxxxx
>     <mailto:ceph-users@xxxxxxx>
>     > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>     <mailto:ceph-users-leave@xxxxxxx>
>     _______________________________________________
>     ceph-users mailing list -- ceph-users@xxxxxxx
>     <mailto:ceph-users@xxxxxxx>
>     To unsubscribe send an email to ceph-users-leave@xxxxxxx
>     <mailto:ceph-users-leave@xxxxxxx>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx