On Thu, 7 Nov 2019, Thomas Schneider wrote: > Hi, > > I have installed package > ceph-mgr_14.2.4-1-gd592e56-1bionic_amd64.deb > manually: > root@ld5505:/home# dpkg --force-depends -i > ceph-mgr_14.2.4-1-gd592e56-1bionic_amd64.deb > (Reading database ... 107461 files and directories currently installed.) > Preparing to unpack ceph-mgr_14.2.4-1-gd592e56-1bionic_amd64.deb ... > Unpacking ceph-mgr (14.2.4-1-gd592e56-1bionic) over > (14.2.4-1-gd592e56-1bionic) ... > dpkg: ceph-mgr: dependency problems, but configuring anyway as you > requested: > ceph-mgr depends on ceph-base (= 14.2.4-1-gd592e56-1bionic); however: > Package ceph-base is not configured yet. > > Setting up ceph-mgr (14.2.4-1-gd592e56-1bionic) ... > > The I restarted ceph-mgr. > > However, there's no effect, means the log entries are still the same. The ceph-mgr package is sufficient. Note that the only change on top of 14.2.4 is that the mgr devicehealth module will scrape OSDs only, not mons. You can probably/hopefully induce the (previously) bad behavior by triggering a scrape manually with 'ceph device scrape-health-metrics'? sage > Or should I install dependencies, namely > ceph-base_14.2.4-1-gd592e56-1bionic_amd64.deb, too? > Or any other packages? > > Installation from repo fails when using this repo-file: > root@ld5506:~# more /etc/apt/sources.list.d/ceph-shaman.list > deb https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/ > bionic main > > W: Failed to fetch > https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/dists/bionic/InRelease ; > 500 Internal Server Error [IP: 147.204.6.136 8080] > W: Some index files failed to download. They have been ignored, or old > ones used instead. > > Regards > Thomas > > Am 07.11.2019 um 10:04 schrieb Oliver Freyermuth: > > Dear Thomas, > > > > the most correct thing to do is probably to add the full repo > > (the original link was still empty for me, but > > https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/ seems > > to work). > > The commit itself suggests the ceph-mgr package should be sufficient. > > > > I'm still pondering though since our cluster is close to production > > (and for now disk health monitoring is disabled) - > > but updating the mgrs alone should also be fine with us. I hope to > > have time for the experiment later today ;-). > > > > Cheers, > > Oliver > > > > Am 07.11.19 um 08:57 schrieb Thomas Schneider: > >> Hi, > >> > >> can you please advise which package(s) should be installed? > >> > >> Thanks > >> > >> > >> > >> Am 06.11.2019 um 22:28 schrieb Sage Weil: > >>> My current working theory is that the mgr is getting hung up when it > >>> tries > >>> to scrape the device metrics from the mon. The 'tell' mechanism > >>> used to > >>> send mon-targetted commands is pretty kludgey/broken in nautilus and > >>> earlier. It's been rewritten for octopus, but isn't worth > >>> backporting--it > >>> never really caused problems until the devicemanager started using it > >>> heavily. > >>> > >>> In any case, this PR just disables scraping of mon devices for > >>> nautilus: > >>> > >>> https://github.com/ceph/ceph/pull/31446 > >>> > >>> There is a build queued at > >>> > >>> > >>> https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/d592e56e$ > >>> > >>> > >>> which should get packages in 1-2 hours. > >>> > >>> Perhaps you can install that package on the mgr host and try again to > >>> reproduce it again? > >>> > >>> I noticed a few other oddities in the logs while looking through them, > >>> like > >>> > >>> https://tracker.ceph.com/issues/42666 > >>> > >>> which will hopefully have a fix ready for 14.2.5. I'm not sure > >>> about that > >>> auth error message, though! > >>> > >>> sage > >>> > >>> > >>> On Sat, 2 Nov 2019, Oliver Freyermuth wrote: > >>> > >>>> Dear Sage, > >>>> > >>>> good news - it happened again, with debug logs! > >>>> There's nothing obvious to my eye, it's uploaded as: > >>>> 0b2d0c09-46f3-4126-aa27-e2d2e8572741 > >>>> It seems the failure was roughly in parallel to me wanting to > >>>> access the dashboard. It must have happened within the last ~5-10 > >>>> minutes of the log. > >>>> > >>>> I'll now go back to "stable operation", in case you need anything > >>>> else, just let me know. > >>>> > >>>> Cheers and all the best, > >>>> Oliver > >>>> > >>>> Am 02.11.19 um 17:38 schrieb Oliver Freyermuth: > >>>>> Dear Sage, > >>>>> > >>>>> at least for the simple case: > >>>>> ceph device get-health-metrics osd.11 > >>>>> => mgr crashes (but in that case, it crashes fully, i.e. the > >>>>> process is gone) > >>>>> I have now uploaded a verbose log as: > >>>>> ceph-post-file: e3bd60ad-cbce-4308-8b07-7ebe7998572e > >>>>> > >>>>> One potential cause of this (and maybe the other issues) might be > >>>>> because some of our OSDs are on non-JBOD controllers and hence are > >>>>> made by forming a Raid 0 per disk, > >>>>> so a simple smartctl on the device will not work (but > >>>>> -dmegaraid,<number> would be needed). > >>>>> > >>>>> Now I have both mgrs active again, debug logging on, device health > >>>>> metrics on again, > >>>>> and am waiting for them to become silent again. Let's hope the > >>>>> issue reappears before the disks run full of logs ;-). > >>>>> > >>>>> Cheers, > >>>>> Oliver > >>>>> > >>>>> Am 02.11.19 um 02:56 schrieb Sage Weil: > >>>>>> On Sat, 2 Nov 2019, Oliver Freyermuth wrote: > >>>>>>> Dear Cephers, > >>>>>>> > >>>>>>> interestingly, after: > >>>>>>> ceph device monitoring off > >>>>>>> the mgrs seem to be stable now - the active one still went > >>>>>>> silent a few minutes later, > >>>>>>> but the standby took over and was stable, and restarting the > >>>>>>> broken one, it's now stable since an hour, too, > >>>>>>> so probably, a restart of the mgr is needed after disabling > >>>>>>> device monitoring to get things stable again. > >>>>>>> > >>>>>>> So it seems to be caused by a problem with the device health > >>>>>>> metrics. In case this is a red herring and mgrs become instable > >>>>>>> again in the next days, > >>>>>>> I'll let you know. > >>>>>> If this seems to stabilize things, and you can tolerate inducing the > >>>>>> failure again, reproducing the problem with mgr logs cranked up > >>>>>> (debug_mgr > >>>>>> = 20, debug_ms = 1) would probably give us a good idea of why the > >>>>>> mgr is > >>>>>> hanging. Let us know! > >>>>>> > >>>>>> Thanks, > >>>>>> sage > >>>>>> > >>>>>> > > >>>>>>> Cheers, > >>>>>>> Oliver > >>>>>>> > >>>>>>> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth: > >>>>>>>> Dear Cephers, > >>>>>>>> > >>>>>>>> this is a 14.2.4 cluster with device health metrics enabled - > >>>>>>>> since about a day, all mgr daemons go "silent" on me after a > >>>>>>>> few hours, i.e. "ceph -s" shows: > >>>>>>>> > >>>>>>>> cluster: > >>>>>>>> id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9 > >>>>>>>> health: HEALTH_WARN > >>>>>>>> no active mgr > >>>>>>>> 1/3 mons down, quorum mon001,mon002 > >>>>>>>> services: > >>>>>>>> mon: 3 daemons, quorum mon001,mon002 (age 57m), out > >>>>>>>> of quorum: mon003 > >>>>>>>> mgr: no daemons active (since 56m) > >>>>>>>> ... > >>>>>>>> (the third mon has a planned outage and will come back in a few > >>>>>>>> days) > >>>>>>>> > >>>>>>>> Checking the logs of the mgr daemons, I find some "reset" > >>>>>>>> messages at the time when it goes "silent", first for the first > >>>>>>>> mgr: > >>>>>>>> > >>>>>>>> 2019-11-01 21:34:40.286 7f2df6a6b700 0 log_channel(cluster) > >>>>>>>> log [DBG] : pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB > >>>>>>>> data, 2.3 TiB used, 136 TiB / 138 TiB avail > >>>>>>>> 2019-11-01 21:34:41.458 7f2e0d59b700 0 client.0 > >>>>>>>> ms_handle_reset on v2:10.160.16.1:6800/401248 > >>>>>>>> 2019-11-01 21:34:42.287 7f2df6a6b700 0 log_channel(cluster) > >>>>>>>> log [DBG] : pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB > >>>>>>>> data, 2.3 TiB used, 136 TiB / 138 TiB avail > >>>>>>>> > >>>>>>>> and a bit later, on the standby mgr: > >>>>>>>> > >>>>>>>> 2019-11-01 22:18:14.892 7f7bcc8ae700 0 log_channel(cluster) > >>>>>>>> log [DBG] : pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, > >>>>>>>> 858 active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, > >>>>>>>> 2.3 TiB used, 136 TiB / 138 TiB avail > >>>>>>>> 2019-11-01 22:18:16.022 7f7be9e72700 0 client.0 > >>>>>>>> ms_handle_reset on v2:10.160.16.2:6800/352196 > >>>>>>>> 2019-11-01 22:18:16.893 7f7bcc8ae700 0 log_channel(cluster) > >>>>>>>> log [DBG] : pgmap v1799: 1585 pgs: 166 active+clean+snaptrim, > >>>>>>>> 858 active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, > >>>>>>>> 2.3 TiB used, 136 TiB / 138 TiB avail > >>>>>>>> > >>>>>>>> Interestingly, the dashboard still works, but presents outdated > >>>>>>>> information, and for example zero I/O going on. > >>>>>>>> I believe this started to happen mainly after the third mon > >>>>>>>> went into the known downtime, but I am not fully sure if this > >>>>>>>> was the trigger, since the cluster is still growing. > >>>>>>>> It may also have been the addition of 24 more OSDs. > >>>>>>>> > >>>>>>>> > >>>>>>>> I also find other messages in the mgr logs which seem > >>>>>>>> problematic, but I am not sure they are related: > >>>>>>>> ------------------------------ > >>>>>>>> 2019-11-01 21:17:09.849 7f2df4266700 0 mgr[devicehealth] Error > >>>>>>>> reading OMAP: [errno 22] Failed to operate read op for oid > >>>>>>>> Traceback (most recent call last): > >>>>>>>> File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, > >>>>>>>> in put_device_metrics > >>>>>>>> ioctx.operate_read_op(op, devid) > >>>>>>>> File "rados.pyx", line 516, in > >>>>>>>> rados.requires.wrapper.validate_func > >>>>>>>> (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL > >>>>>>>> D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721) > >>>>>>>> File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op > >>>>>>>> (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554) > >>>>>>>> InvalidArgumentError: [errno 22] Failed to operate read op for oid > >>>>>>>> ------------------------------ > >>>>>>>> or: > >>>>>>>> ------------------------------ > >>>>>>>> 2019-11-01 21:33:53.977 7f7bd38bc700 0 mgr[devicehealth] Fail > >>>>>>>> to parse JSON result from daemon osd.51 () > >>>>>>>> 2019-11-01 21:33:53.978 7f7bd38bc700 0 mgr[devicehealth] Fail > >>>>>>>> to parse JSON result from daemon osd.52 () > >>>>>>>> 2019-11-01 21:33:53.979 7f7bd38bc700 0 mgr[devicehealth] Fail > >>>>>>>> to parse JSON result from daemon osd.53 () > >>>>>>>> ------------------------------ > >>>>>>>> > >>>>>>>> The reason why I am cautious about the health metrics is that I > >>>>>>>> observed a crash when trying to query them: > >>>>>>>> ------------------------------ > >>>>>>>> 2019-11-01 20:21:23.661 7fa46314a700 0 log_channel(audit) log > >>>>>>>> [DBG] : from='client.174136 -' entity='client.admin' > >>>>>>>> cmd=[{"prefix": "device get-health-metrics", "devid": "osd.11", > >>>>>>>> "target": ["mgr", ""]}]: dispatch > >>>>>>>> 2019-11-01 20:21:23.661 7fa46394b700 0 mgr[devicehealth] > >>>>>>>> handle_command > >>>>>>>> 2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal > >>>>>>>> (Segmentation fault) ** > >>>>>>>> in thread 7fa46394b700 thread_name:mgr-fin > >>>>>>>> > >>>>>>>> ceph version 14.2.4 > >>>>>>>> (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable) > >>>>>>>> 1: (()+0xf5f0) [0x7fa488cee5f0] > >>>>>>>> 2: (PyEval_EvalFrameEx()+0x1a9) [0x7fa48aeb50f9] > >>>>>>>> 3: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d] > >>>>>>>> 4: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d] > >>>>>>>> 5: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d] > >>>>>>>> 6: (PyEval_EvalCodeEx()+0x7ed) [0x7fa48aebe08d] > >>>>>>>> 7: (()+0x709c8) [0x7fa48ae479c8] > >>>>>>>> 8: (PyObject_Call()+0x43) [0x7fa48ae22ab3] > >>>>>>>> 9: (()+0x5aaa5) [0x7fa48ae31aa5] > >>>>>>>> 10: (PyObject_Call()+0x43) [0x7fa48ae22ab3] > >>>>>>>> 11: (()+0x4bb95) [0x7fa48ae22b95] > >>>>>>>> 12: (PyObject_CallMethod()+0xbb) [0x7fa48ae22ecb] > >>>>>>>> 13: (ActivePyModule::handle_command(std::map<std::string, > >>>>>>>> boost::variant<std::string, bool, long, double, > >>>>>>>> std::vector<std::string, std::allocator<std::string> >, > >>>>>>>> std::vector<long, std::allocator<long> >, std::vector<double, > >>>>>>>> std::allocator<double> > >, std::less<void>, > >>>>>>>> std::allocator<std::pair<std::string const, > >>>>>>>> boost::variant<std::string, bool, long, double, > >>>>>>>> std::vector<std::string, std::allocator<std::string> >, > >>>>>>>> std::vector<long, std::allocator<long> >, std::vector<double, > >>>>>>>> std::allocator<double> > > > > > const&, > >>>>>>>> ceph::buffer::v14_2_0::list const&, > >>>>>>>> std::basic_stringstream<char, std::char_traits<char>, > >>>>>>>> std::allocator<char> >*, std::basic_stringstream<char, > >>>>>>>> std::char_traits<char>, std::allocator<char> >*)+0x20e) > >>>>>>>> [0x55c3c1fefc5e] > >>>>>>>> 14: (()+0x16c23d) [0x55c3c204023d] > >>>>>>>> 15: (FunctionContext::finish(int)+0x2c) [0x55c3c2001eac] > >>>>>>>> 16: (Context::complete(int)+0x9) [0x55c3c1ffe659] > >>>>>>>> 17: (Finisher::finisher_thread_entry()+0x156) [0x7fa48b439cc6] > >>>>>>>> 18: (()+0x7e65) [0x7fa488ce6e65] > >>>>>>>> 19: (clone()+0x6d) [0x7fa48799488d] > >>>>>>>> NOTE: a copy of the executable, or `objdump -rdS > >>>>>>>> <executable>` is needed to interpret this. > >>>>>>>> ------------------------------ > >>>>>>>> > >>>>>>>> I have issued: > >>>>>>>> ceph device monitoring off > >>>>>>>> for now and will keep waiting to see if mgrs go silent again. > >>>>>>>> If there are any better ideas or this issue is known, let me know. > >>>>>>>> > >>>>>>>> Cheers, > >>>>>>>> Oliver > >>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>>>>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>>>> > >>>> > >>>> > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@xxxxxxx > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > > > >
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx