Dear Janek, in my case, the mgr daemon itself remains "running", it just stops reporting to the mon. It even still serves the dashboard, but with outdated information. I grepped through the logs and could not find any clock skew messages. So it seems to be a different issue (albeit both issues seem to be triggered by the devicehealth module). Cheers, Oliver On 2019-11-02 18:28, Janek Bevendorff wrote: > These issues sound a bit like a bug I reported a few days ago: https://tracker.ceph.com/issues/39264 <https://tracker.ceph.com/issues/39264#change-149689> > > Related: https://tracker.ceph.com/issues/39264 <https://tracker.ceph.com/issues/39264#change-149689> > > On 02/11/2019 17:34, Oliver Freyermuth wrote: >> Dear Reed, >> >> yes, also the balancer is on for me - but the instabilities vanished as soon as I turned off device health metrics. >> >> Cheers, >> Oliver >> >> Am 02.11.19 um 17:31 schrieb Reed Dier: >>> Do you also have the balancer module on? >>> >>> I experienced extremely bad stability issues where the MGRs would silently die with the balancer module on. >>> And by on, I mean 'active:true` by way of `ceph balancer on`. >>> >>> Once I disabled the automatic balancer, it seemed to become much more stable. >>> >>> I can still manually run the balancer without issues (except for one pool), but the balancer is what appeared to be my big driver of instability. >>> >>> Reed >>> >>>> On Nov 2, 2019, at 11:24 AM, Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> wrote: >>>> >>>> Hi Thomas, >>>> >>>> indeed, I also had the dashboard open at these times - but right now, after disabling device health metrics, >>>> I can not retrigger it even when playing wildly on the dashboard. >>>> >>>> So I'll now reenable health metrics and try to retrigger the issue with cranked up debug levels as Sage suggested. >>>> Maybe in your case, if you can stand mgr failures, this would also be interesting to get the dashboard issue debugged? >>>> >>>> Cheers, >>>> Oliver >>>> >>>> Am 02.11.19 um 08:23 schrieb Thomas: >>>>> Hi Oliver, >>>>> >>>>> I experienced a situation where MGRs "goes crazy", means MGR was active but not working. >>>>> In the logs of the standby MGR nodes I found an error (after restarting service) that pointed to Ceph Dashboard. >>>>> >>>>> Since disabling the dashboard my MGRs are stable again. >>>>> >>>>> Regards >>>>> Thomas >>>>> >>>>> Am 02.11.2019 um 02:48 schrieb Oliver Freyermuth: >>>>>> Dear Cephers, >>>>>> >>>>>> interestingly, after: >>>>>> ceph device monitoring off >>>>>> the mgrs seem to be stable now - the active one still went silent a few minutes later, >>>>>> but the standby took over and was stable, and restarting the broken one, it's now stable since an hour, too, >>>>>> so probably, a restart of the mgr is needed after disabling device monitoring to get things stable again. >>>>>> >>>>>> So it seems to be caused by a problem with the device health metrics. In case this is a red herring and mgrs become instable again in the next days, >>>>>> I'll let you know. >>>>>> >>>>>> Cheers, >>>>>> Oliver >>>>>> >>>>>> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth: >>>>>>> Dear Cephers, >>>>>>> >>>>>>> this is a 14.2.4 cluster with device health metrics enabled - since about a day, all mgr daemons go "silent" on me after a few hours, i.e. "ceph -s" shows: >>>>>>> >>>>>>> cluster: >>>>>>> id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9 >>>>>>> health: HEALTH_WARN >>>>>>> no active mgr >>>>>>> 1/3 mons down, quorum mon001,mon002 >>>>>>> services: >>>>>>> mon: 3 daemons, quorum mon001,mon002 (age 57m), out of quorum: mon003 >>>>>>> mgr: no daemons active (since 56m) >>>>>>> ... >>>>>>> (the third mon has a planned outage and will come back in a few days) >>>>>>> >>>>>>> Checking the logs of the mgr daemons, I find some "reset" messages at the time when it goes "silent", first for the first mgr: >>>>>>> >>>>>>> 2019-11-01 21:34:40.286 7f2df6a6b700 0 log_channel(cluster) log [DBG] : pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail >>>>>>> 2019-11-01 21:34:41.458 7f2e0d59b700 0 client.0 ms_handle_reset on v2:10.160.16.1:6800/401248 >>>>>>> 2019-11-01 21:34:42.287 7f2df6a6b700 0 log_channel(cluster) log [DBG] : pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail >>>>>>> >>>>>>> and a bit later, on the standby mgr: >>>>>>> >>>>>>> 2019-11-01 22:18:14.892 7f7bcc8ae700 0 log_channel(cluster) log [DBG] : pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail >>>>>>> 2019-11-01 22:18:16.022 7f7be9e72700 0 client.0 ms_handle_reset on v2:10.160.16.2:6800/352196 >>>>>>> 2019-11-01 22:18:16.893 7f7bcc8ae700 0 log_channel(cluster) log [DBG] : pgmap v1799: 1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail >>>>>>> >>>>>>> Interestingly, the dashboard still works, but presents outdated information, and for example zero I/O going on. >>>>>>> I believe this started to happen mainly after the third mon went into the known downtime, but I am not fully sure if this was the trigger, since the cluster is still growing. >>>>>>> It may also have been the addition of 24 more OSDs. >>>>>>> >>>>>>> >>>>>>> I also find other messages in the mgr logs which seem problematic, but I am not sure they are related: >>>>>>> ------------------------------ >>>>>>> 2019-11-01 21:17:09.849 7f2df4266700 0 mgr[devicehealth] Error reading OMAP: [errno 22] Failed to operate read op for oid >>>>>>> Traceback (most recent call last): >>>>>>> File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, in put_device_metrics >>>>>>> ioctx.operate_read_op(op, devid) >>>>>>> File "rados.pyx", line 516, in rados.requires.wrapper.validate_func (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL >>>>>>> D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721) >>>>>>> File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554) >>>>>>> InvalidArgumentError: [errno 22] Failed to operate read op for oid >>>>>>> ------------------------------ >>>>>>> or: >>>>>>> ------------------------------ >>>>>>> 2019-11-01 21:33:53.977 7f7bd38bc700 0 mgr[devicehealth] Fail to parse JSON result from daemon osd.51 () >>>>>>> 2019-11-01 21:33:53.978 7f7bd38bc700 0 mgr[devicehealth] Fail to parse JSON result from daemon osd.52 () >>>>>>> 2019-11-01 21:33:53.979 7f7bd38bc700 0 mgr[devicehealth] Fail to parse JSON result from daemon osd.53 () >>>>>>> ------------------------------ >>>>>>> >>>>>>> The reason why I am cautious about the health metrics is that I observed a crash when trying to query them: >>>>>>> ------------------------------ >>>>>>> 2019-11-01 20:21:23.661 7fa46314a700 0 log_channel(audit) log [DBG] : from='client.174136 -' entity='client.admin' cmd=[{"prefix": "device get-health-metrics", "devid": "osd.11", "target": ["mgr", ""]}]: dispatch >>>>>>> 2019-11-01 20:21:23.661 7fa46394b700 0 mgr[devicehealth] handle_command >>>>>>> 2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal (Segmentation fault) ** >>>>>>> in thread 7fa46394b700 thread_name:mgr-fin >>>>>>> >>>>>>> ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable) >>>>>>> 1: (()+0xf5f0) [0x7fa488cee5f0] >>>>>>> 2: (PyEval_EvalFrameEx()+0x1a9) [0x7fa48aeb50f9] >>>>>>> 3: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d] >>>>>>> 4: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d] >>>>>>> 5: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d] >>>>>>> 6: (PyEval_EvalCodeEx()+0x7ed) [0x7fa48aebe08d] >>>>>>> 7: (()+0x709c8) [0x7fa48ae479c8] >>>>>>> 8: (PyObject_Call()+0x43) [0x7fa48ae22ab3] >>>>>>> 9: (()+0x5aaa5) [0x7fa48ae31aa5] >>>>>>> 10: (PyObject_Call()+0x43) [0x7fa48ae22ab3] >>>>>>> 11: (()+0x4bb95) [0x7fa48ae22b95] >>>>>>> 12: (PyObject_CallMethod()+0xbb) [0x7fa48ae22ecb] >>>>>>> 13: (ActivePyModule::handle_command(std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::buffer::v14_2_0::list const&, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >*, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >*)+0x20e) [0x55c3c1fefc5e] >>>>>>> 14: (()+0x16c23d) [0x55c3c204023d] >>>>>>> 15: (FunctionContext::finish(int)+0x2c) [0x55c3c2001eac] >>>>>>> 16: (Context::complete(int)+0x9) [0x55c3c1ffe659] >>>>>>> 17: (Finisher::finisher_thread_entry()+0x156) [0x7fa48b439cc6] >>>>>>> 18: (()+0x7e65) [0x7fa488ce6e65] >>>>>>> 19: (clone()+0x6d) [0x7fa48799488d] >>>>>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. >>>>>>> ------------------------------ >>>>>>> >>>>>>> I have issued: >>>>>>> ceph device monitoring off >>>>>>> for now and will keep waiting to see if mgrs go silent again. If there are any better ideas or this issue is known, let me know. >>>>>>> >>>>>>> Cheers, >>>>>>> Oliver >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx >
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx