Hi, does anybody get this error messages in MGR log? 2019-11-06 15:41:44.765 7f10db740700 0 auth: could not find secret_id=3863 2019-11-06 15:41:44.765 7f10db740700 0 cephx: verify_authorizer could not get service secret for service mgr secret_id=3863 THX Am 06.11.2019 um 10:43 schrieb Oliver Freyermuth: > Hi together, > > interestingly, now that the third mon is missing for almost a week > (those planned interventions always take longer than expected...), > we get mgr failovers (but without crashes). > > In the mgr log, I find: > > 2019-11-06 07:50:05.409 7fce8a0dc700 0 client.0 ms_handle_reset on > v2:10.160.16.1:6800/618072 > ... > ... the normal churning ... > ... > 2019-11-06 07:52:44.113 7fce8a0dc700 -1 mgr handle_mgr_map I was > active but no longer am > 2019-11-06 07:52:44.113 7fce8a0dc700 1 mgr respawn e: > '/usr/bin/ceph-mgr' > > In the mon log, I see: > ... > 2019-11-06 07:44:11.565 7f1f44453700 4 rocksdb: > [db/db_impl_files.cc:356] [JOB 225] Try to delete WAL files size > 10830909, prev total WAL file size 10839895, number of live WAL files 2. > > 2019-11-06 07:44:11.565 7f1f3a43f700 4 rocksdb: > [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction > starting > 2019-11-06 07:44:11.565 7f1f44c54700 4 rocksdb: (Original Log Time > 2019/11/06-07:44:11.565802) [db/db_impl_compaction_flush.cc:2374] > [default] Manual compaction from level-0 to level-6 from 'mgrstat .. > 'mgrstat; will stop at (end) > ... > 2019-11-06 07:50:36.734 7f1f3a43f700 4 rocksdb: > [db/db_impl_compaction_flush.cc:1403] [default] Manual compaction > starting > 2019-11-06 07:52:27.046 7f1f4144d700 0 log_channel(cluster) log [INF] > : Manager daemon mon001 is unresponsive, replacing it with standby > daemon mon002 > ... > > There's a lot of compaction going on (probably due to the prolonged > HEALTH_WARN state, so not really unexpected) > so I wonder whether the actual cause for identifying the mgr as > "unresponsive" is the heavy compaction on the mons. > It will be interesting to see what happens when we finally have the > third mon back and the cluster becomes healthy again... > > Did somebody see something similar after running for a week or more > with Nautilus on old and slow hardware? > > Cheers, > Oliver > > Am 02.11.19 um 18:20 schrieb Oliver Freyermuth: >> Dear Sage, >> >> good news - it happened again, with debug logs! >> There's nothing obvious to my eye, it's uploaded as: >> 0b2d0c09-46f3-4126-aa27-e2d2e8572741 >> It seems the failure was roughly in parallel to me wanting to access >> the dashboard. It must have happened within the last ~5-10 minutes of >> the log. >> >> I'll now go back to "stable operation", in case you need anything >> else, just let me know. >> >> Cheers and all the best, >> Oliver >> >> Am 02.11.19 um 17:38 schrieb Oliver Freyermuth: >>> Dear Sage, >>> >>> at least for the simple case: >>> ceph device get-health-metrics osd.11 >>> => mgr crashes (but in that case, it crashes fully, i.e. the process >>> is gone) >>> I have now uploaded a verbose log as: >>> ceph-post-file: e3bd60ad-cbce-4308-8b07-7ebe7998572e >>> >>> One potential cause of this (and maybe the other issues) might be >>> because some of our OSDs are on non-JBOD controllers and hence are >>> made by forming a Raid 0 per disk, >>> so a simple smartctl on the device will not work (but >>> -dmegaraid,<number> would be needed). >>> >>> Now I have both mgrs active again, debug logging on, device health >>> metrics on again, >>> and am waiting for them to become silent again. Let's hope the issue >>> reappears before the disks run full of logs ;-). >>> >>> Cheers, >>> Oliver >>> >>> Am 02.11.19 um 02:56 schrieb Sage Weil: >>>> On Sat, 2 Nov 2019, Oliver Freyermuth wrote: >>>>> Dear Cephers, >>>>> >>>>> interestingly, after: >>>>> ceph device monitoring off >>>>> the mgrs seem to be stable now - the active one still went silent >>>>> a few minutes later, >>>>> but the standby took over and was stable, and restarting the >>>>> broken one, it's now stable since an hour, too, >>>>> so probably, a restart of the mgr is needed after disabling device >>>>> monitoring to get things stable again. >>>>> >>>>> So it seems to be caused by a problem with the device health >>>>> metrics. In case this is a red herring and mgrs become instable >>>>> again in the next days, >>>>> I'll let you know. >>>> >>>> If this seems to stabilize things, and you can tolerate inducing the >>>> failure again, reproducing the problem with mgr logs cranked up >>>> (debug_mgr >>>> = 20, debug_ms = 1) would probably give us a good idea of why the >>>> mgr is >>>> hanging. Let us know! >>>> >>>> Thanks, >>>> sage >>>> >>>> > >>>>> Cheers, >>>>> Oliver >>>>> >>>>> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth: >>>>>> Dear Cephers, >>>>>> >>>>>> this is a 14.2.4 cluster with device health metrics enabled - >>>>>> since about a day, all mgr daemons go "silent" on me after a few >>>>>> hours, i.e. "ceph -s" shows: >>>>>> >>>>>> cluster: >>>>>> id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9 >>>>>> health: HEALTH_WARN >>>>>> no active mgr >>>>>> 1/3 mons down, quorum mon001,mon002 >>>>>> services: >>>>>> mon: 3 daemons, quorum mon001,mon002 (age 57m), out >>>>>> of quorum: mon003 >>>>>> mgr: no daemons active (since 56m) >>>>>> ... >>>>>> (the third mon has a planned outage and will come back in a few >>>>>> days) >>>>>> >>>>>> Checking the logs of the mgr daemons, I find some "reset" >>>>>> messages at the time when it goes "silent", first for the first mgr: >>>>>> >>>>>> 2019-11-01 21:34:40.286 7f2df6a6b700 0 log_channel(cluster) log >>>>>> [DBG] : pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, >>>>>> 2.3 TiB used, 136 TiB / 138 TiB avail >>>>>> 2019-11-01 21:34:41.458 7f2e0d59b700 0 client.0 ms_handle_reset >>>>>> on v2:10.160.16.1:6800/401248 >>>>>> 2019-11-01 21:34:42.287 7f2df6a6b700 0 log_channel(cluster) log >>>>>> [DBG] : pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, >>>>>> 2.3 TiB used, 136 TiB / 138 TiB avail >>>>>> >>>>>> and a bit later, on the standby mgr: >>>>>> >>>>>> 2019-11-01 22:18:14.892 7f7bcc8ae700 0 log_channel(cluster) log >>>>>> [DBG] : pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, 858 >>>>>> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 >>>>>> TiB used, 136 TiB / 138 TiB avail >>>>>> 2019-11-01 22:18:16.022 7f7be9e72700 0 client.0 ms_handle_reset >>>>>> on v2:10.160.16.2:6800/352196 >>>>>> 2019-11-01 22:18:16.893 7f7bcc8ae700 0 log_channel(cluster) log >>>>>> [DBG] : pgmap v1799: 1585 pgs: 166 active+clean+snaptrim, 858 >>>>>> active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 >>>>>> TiB used, 136 TiB / 138 TiB avail >>>>>> >>>>>> Interestingly, the dashboard still works, but presents outdated >>>>>> information, and for example zero I/O going on. >>>>>> I believe this started to happen mainly after the third mon went >>>>>> into the known downtime, but I am not fully sure if this was the >>>>>> trigger, since the cluster is still growing. >>>>>> It may also have been the addition of 24 more OSDs. >>>>>> >>>>>> >>>>>> I also find other messages in the mgr logs which seem >>>>>> problematic, but I am not sure they are related: >>>>>> ------------------------------ >>>>>> 2019-11-01 21:17:09.849 7f2df4266700 0 mgr[devicehealth] Error >>>>>> reading OMAP: [errno 22] Failed to operate read op for oid >>>>>> Traceback (most recent call last): >>>>>> File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, >>>>>> in put_device_metrics >>>>>> ioctx.operate_read_op(op, devid) >>>>>> File "rados.pyx", line 516, in >>>>>> rados.requires.wrapper.validate_func >>>>>> (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL >>>>>> D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721) >>>>>> File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op >>>>>> (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554) >>>>>> InvalidArgumentError: [errno 22] Failed to operate read op for oid >>>>>> ------------------------------ >>>>>> or: >>>>>> ------------------------------ >>>>>> 2019-11-01 21:33:53.977 7f7bd38bc700 0 mgr[devicehealth] Fail to >>>>>> parse JSON result from daemon osd.51 () >>>>>> 2019-11-01 21:33:53.978 7f7bd38bc700 0 mgr[devicehealth] Fail to >>>>>> parse JSON result from daemon osd.52 () >>>>>> 2019-11-01 21:33:53.979 7f7bd38bc700 0 mgr[devicehealth] Fail to >>>>>> parse JSON result from daemon osd.53 () >>>>>> ------------------------------ >>>>>> >>>>>> The reason why I am cautious about the health metrics is that I >>>>>> observed a crash when trying to query them: >>>>>> ------------------------------ >>>>>> 2019-11-01 20:21:23.661 7fa46314a700 0 log_channel(audit) log >>>>>> [DBG] : from='client.174136 -' entity='client.admin' >>>>>> cmd=[{"prefix": "device get-health-metrics", "devid": "osd.11", >>>>>> "target": ["mgr", ""]}]: dispatch >>>>>> 2019-11-01 20:21:23.661 7fa46394b700 0 mgr[devicehealth] >>>>>> handle_command >>>>>> 2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal >>>>>> (Segmentation fault) ** >>>>>> in thread 7fa46394b700 thread_name:mgr-fin >>>>>> >>>>>> ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) >>>>>> nautilus (stable) >>>>>> 1: (()+0xf5f0) [0x7fa488cee5f0] >>>>>> 2: (PyEval_EvalFrameEx()+0x1a9) [0x7fa48aeb50f9] >>>>>> 3: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d] >>>>>> 4: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d] >>>>>> 5: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d] >>>>>> 6: (PyEval_EvalCodeEx()+0x7ed) [0x7fa48aebe08d] >>>>>> 7: (()+0x709c8) [0x7fa48ae479c8] >>>>>> 8: (PyObject_Call()+0x43) [0x7fa48ae22ab3] >>>>>> 9: (()+0x5aaa5) [0x7fa48ae31aa5] >>>>>> 10: (PyObject_Call()+0x43) [0x7fa48ae22ab3] >>>>>> 11: (()+0x4bb95) [0x7fa48ae22b95] >>>>>> 12: (PyObject_CallMethod()+0xbb) [0x7fa48ae22ecb] >>>>>> 13: (ActivePyModule::handle_command(std::map<std::string, >>>>>> boost::variant<std::string, bool, long, double, >>>>>> std::vector<std::string, std::allocator<std::string> >, >>>>>> std::vector<long, std::allocator<long> >, std::vector<double, >>>>>> std::allocator<double> > >, std::less<void>, >>>>>> std::allocator<std::pair<std::string const, >>>>>> boost::variant<std::string, bool, long, double, >>>>>> std::vector<std::string, std::allocator<std::string> >, >>>>>> std::vector<long, std::allocator<long> >, std::vector<double, >>>>>> std::allocator<double> > > > > > const&, >>>>>> ceph::buffer::v14_2_0::list const&, std::basic_stringstream<char, >>>>>> std::char_traits<char>, std::allocator<char> >*, >>>>>> std::basic_stringstream<char, std::char_traits<char>, >>>>>> std::allocator<char> >*)+0x20e) [0x55c3c1fefc5e] >>>>>> 14: (()+0x16c23d) [0x55c3c204023d] >>>>>> 15: (FunctionContext::finish(int)+0x2c) [0x55c3c2001eac] >>>>>> 16: (Context::complete(int)+0x9) [0x55c3c1ffe659] >>>>>> 17: (Finisher::finisher_thread_entry()+0x156) [0x7fa48b439cc6] >>>>>> 18: (()+0x7e65) [0x7fa488ce6e65] >>>>>> 19: (clone()+0x6d) [0x7fa48799488d] >>>>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` >>>>>> is needed to interpret this. >>>>>> ------------------------------ >>>>>> >>>>>> I have issued: >>>>>> ceph device monitoring off >>>>>> for now and will keep waiting to see if mgrs go silent again. If >>>>>> there are any better ideas or this issue is known, let me know. >>>>>> >>>>>> Cheers, >>>>>> Oliver >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>> >>>>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> >> >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx