Re: mgr daemons becoming unresponsive

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 7 Nov 2019 13:33:56 +0000 (UTC)

On Thu, 7 Nov 2019, Thomas Schneider wrote:
> Hi,
> 
> I have installed package
> ceph-mgr_14.2.4-1-gd592e56-1bionic_amd64.deb
> manually:
> root@ld5505:/home# dpkg --force-depends -i
> ceph-mgr_14.2.4-1-gd592e56-1bionic_amd64.deb
> (Reading database ... 107461 files and directories currently installed.)
> Preparing to unpack ceph-mgr_14.2.4-1-gd592e56-1bionic_amd64.deb ...
> Unpacking ceph-mgr (14.2.4-1-gd592e56-1bionic) over
> (14.2.4-1-gd592e56-1bionic) ...
> dpkg: ceph-mgr: dependency problems, but configuring anyway as you
> requested:
>  ceph-mgr depends on ceph-base (= 14.2.4-1-gd592e56-1bionic); however:
>   Package ceph-base is not configured yet.
> 
> Setting up ceph-mgr (14.2.4-1-gd592e56-1bionic) ...
> 
> The I restarted ceph-mgr.
> 
> However, there's no effect, means the log entries are still the same.

The ceph-mgr package is sufficient.

Note that the only change on top of 14.2.4 is that the mgr devicehealth 
module will scrape OSDs only, not mons.

You can probably/hopefully induce the (previously) bad behavior by 
triggering a scrape manually with 'ceph device scrape-health-metrics'?

sage

> Or should I install dependencies, namely
> ceph-base_14.2.4-1-gd592e56-1bionic_amd64.deb, too?
> Or any other packages?
> 
> Installation from repo fails when using this repo-file:
> root@ld5506:~# more /etc/apt/sources.list.d/ceph-shaman.list
> deb https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/
> bionic main
> 
> W: Failed to fetch
> https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/dists/bionic/InRelease ;
> 500  Internal Server Error [IP: 147.204.6.136 8080]
> W: Some index files failed to download. They have been ignored, or old
> ones used instead.
> 
> Regards
> Thomas
> 
> Am 07.11.2019 um 10:04 schrieb Oliver Freyermuth:
> > Dear Thomas,
> >
> > the most correct thing to do is probably to add the full repo
> > (the original link was still empty for me, but
> > https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/ seems
> > to work).
> > The commit itself suggests the ceph-mgr package should be sufficient.
> >
> > I'm still pondering though since our cluster is close to production
> > (and for now disk health monitoring is disabled) -
> > but updating the mgrs alone should also be fine with us. I hope to
> > have time for the experiment later today ;-).
> >
> > Cheers,
> >     Oliver
> >
> > Am 07.11.19 um 08:57 schrieb Thomas Schneider:
> >> Hi,
> >>
> >> can you please advise which package(s) should be installed?
> >>
> >> Thanks
> >>
> >>
> >>
> >> Am 06.11.2019 um 22:28 schrieb Sage Weil:
> >>> My current working theory is that the mgr is getting hung up when it
> >>> tries
> >>> to scrape the device metrics from the mon.  The 'tell' mechanism
> >>> used to
> >>> send mon-targetted commands is pretty kludgey/broken in nautilus and
> >>> earlier.  It's been rewritten for octopus, but isn't worth
> >>> backporting--it
> >>> never really caused problems until the devicemanager started using it
> >>> heavily.
> >>>
> >>> In any case, this PR just disables scraping of mon devices for
> >>> nautilus:
> >>>
> >>>          https://github.com/ceph/ceph/pull/31446
> >>>
> >>> There is a build queued at
> >>>
> >>>         
> >>> https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/d592e56e$
> >>>
> >>>
> >>> which should get packages in 1-2 hours.
> >>>
> >>> Perhaps you can install that package on the mgr host and try again to
> >>> reproduce it again?
> >>>
> >>> I noticed a few other oddities in the logs while looking through them,
> >>> like
> >>>
> >>>     https://tracker.ceph.com/issues/42666
> >>>
> >>> which will hopefully have a fix ready for 14.2.5.  I'm not sure
> >>> about that
> >>> auth error message, though!
> >>>
> >>> sage
> >>>
> >>>
> >>> On Sat, 2 Nov 2019, Oliver Freyermuth wrote:
> >>>
> >>>> Dear Sage,
> >>>>
> >>>> good news - it happened again, with debug logs!
> >>>> There's nothing obvious to my eye, it's uploaded as:
> >>>> 0b2d0c09-46f3-4126-aa27-e2d2e8572741
> >>>> It seems the failure was roughly in parallel to me wanting to
> >>>> access the dashboard. It must have happened within the last ~5-10
> >>>> minutes of the log.
> >>>>
> >>>> I'll now go back to "stable operation", in case you need anything
> >>>> else, just let me know.
> >>>>
> >>>> Cheers and all the best,
> >>>>     Oliver
> >>>>
> >>>> Am 02.11.19 um 17:38 schrieb Oliver Freyermuth:
> >>>>> Dear Sage,
> >>>>>
> >>>>> at least for the simple case:
> >>>>>   ceph device get-health-metrics osd.11
> >>>>> => mgr crashes (but in that case, it crashes fully, i.e. the
> >>>>> process is gone)
> >>>>> I have now uploaded a verbose log as:
> >>>>> ceph-post-file: e3bd60ad-cbce-4308-8b07-7ebe7998572e
> >>>>>
> >>>>> One potential cause of this (and maybe the other issues) might be
> >>>>> because some of our OSDs are on non-JBOD controllers and hence are
> >>>>> made by forming a Raid 0 per disk,
> >>>>> so a simple smartctl on the device will not work (but
> >>>>> -dmegaraid,<number> would be needed).
> >>>>>
> >>>>> Now I have both mgrs active again, debug logging on, device health
> >>>>> metrics on again,
> >>>>> and am waiting for them to become silent again. Let's hope the
> >>>>> issue reappears before the disks run full of logs ;-).
> >>>>>
> >>>>> Cheers,
> >>>>>     Oliver
> >>>>>
> >>>>> Am 02.11.19 um 02:56 schrieb Sage Weil:
> >>>>>> On Sat, 2 Nov 2019, Oliver Freyermuth wrote:
> >>>>>>> Dear Cephers,
> >>>>>>>
> >>>>>>> interestingly, after:
> >>>>>>>   ceph device monitoring off
> >>>>>>> the mgrs seem to be stable now - the active one still went
> >>>>>>> silent a few minutes later,
> >>>>>>> but the standby took over and was stable, and restarting the
> >>>>>>> broken one, it's now stable since an hour, too,
> >>>>>>> so probably, a restart of the mgr is needed after disabling
> >>>>>>> device monitoring to get things stable again.
> >>>>>>>
> >>>>>>> So it seems to be caused by a problem with the device health
> >>>>>>> metrics. In case this is a red herring and mgrs become instable
> >>>>>>> again in the next days,
> >>>>>>> I'll let you know.
> >>>>>> If this seems to stabilize things, and you can tolerate inducing the
> >>>>>> failure again, reproducing the problem with mgr logs cranked up
> >>>>>> (debug_mgr
> >>>>>> = 20, debug_ms = 1) would probably give us a good idea of why the
> >>>>>> mgr is
> >>>>>> hanging.  Let us know!
> >>>>>>
> >>>>>> Thanks,
> >>>>>> sage
> >>>>>>
> >>>>>>   >
> >>>>>>> Cheers,
> >>>>>>>     Oliver
> >>>>>>>
> >>>>>>> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:
> >>>>>>>> Dear Cephers,
> >>>>>>>>
> >>>>>>>> this is a 14.2.4 cluster with device health metrics enabled -
> >>>>>>>> since about a day, all mgr daemons go "silent" on me after a
> >>>>>>>> few hours, i.e. "ceph -s" shows:
> >>>>>>>>
> >>>>>>>>    cluster:
> >>>>>>>>      id:     269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
> >>>>>>>>      health: HEALTH_WARN
> >>>>>>>>              no active mgr
> >>>>>>>>              1/3 mons down, quorum mon001,mon002
> >>>>>>>>      services:
> >>>>>>>>      mon:        3 daemons, quorum mon001,mon002 (age 57m), out
> >>>>>>>> of quorum: mon003
> >>>>>>>>      mgr:        no daemons active (since 56m)
> >>>>>>>>      ...
> >>>>>>>> (the third mon has a planned outage and will come back in a few
> >>>>>>>> days)
> >>>>>>>>
> >>>>>>>> Checking the logs of the mgr daemons, I find some "reset"
> >>>>>>>> messages at the time when it goes "silent", first for the first
> >>>>>>>> mgr:
> >>>>>>>>
> >>>>>>>> 2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster)
> >>>>>>>> log [DBG] : pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB
> >>>>>>>> data, 2.3 TiB used, 136 TiB / 138 TiB avail
> >>>>>>>> 2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0
> >>>>>>>> ms_handle_reset on v2:10.160.16.1:6800/401248
> >>>>>>>> 2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster)
> >>>>>>>> log [DBG] : pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB
> >>>>>>>> data, 2.3 TiB used, 136 TiB / 138 TiB avail
> >>>>>>>>
> >>>>>>>> and a bit later, on the standby mgr:
> >>>>>>>>
> >>>>>>>> 2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster)
> >>>>>>>> log [DBG] : pgmap v1798: 1585 pgs: 166 active+clean+snaptrim,
> >>>>>>>> 858 active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data,
> >>>>>>>> 2.3 TiB used, 136 TiB / 138 TiB avail
> >>>>>>>> 2019-11-01 22:18:16.022 7f7be9e72700  0 client.0
> >>>>>>>> ms_handle_reset on v2:10.160.16.2:6800/352196
> >>>>>>>> 2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster)
> >>>>>>>> log [DBG] : pgmap v1799: 1585 pgs: 166 active+clean+snaptrim,
> >>>>>>>> 858 active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data,
> >>>>>>>> 2.3 TiB used, 136 TiB / 138 TiB avail
> >>>>>>>>
> >>>>>>>> Interestingly, the dashboard still works, but presents outdated
> >>>>>>>> information, and for example zero I/O going on.
> >>>>>>>> I believe this started to happen mainly after the third mon
> >>>>>>>> went into the known downtime, but I am not fully sure if this
> >>>>>>>> was the trigger, since the cluster is still growing.
> >>>>>>>> It may also have been the addition of 24 more OSDs.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I also find other messages in the mgr logs which seem
> >>>>>>>> problematic, but I am not sure they are related:
> >>>>>>>> ------------------------------
> >>>>>>>> 2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error
> >>>>>>>> reading OMAP: [errno 22] Failed to operate read op for oid
> >>>>>>>> Traceback (most recent call last):
> >>>>>>>>    File "/usr/share/ceph/mgr/devicehealth/module.py", line 396,
> >>>>>>>> in put_device_metrics
> >>>>>>>>      ioctx.operate_read_op(op, devid)
> >>>>>>>>    File "rados.pyx", line 516, in
> >>>>>>>> rados.requires.wrapper.validate_func
> >>>>>>>> (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL
> >>>>>>>> D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
> >>>>>>>>    File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op
> >>>>>>>> (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554)
> >>>>>>>> InvalidArgumentError: [errno 22] Failed to operate read op for oid
> >>>>>>>> ------------------------------
> >>>>>>>> or:
> >>>>>>>> ------------------------------
> >>>>>>>> 2019-11-01 21:33:53.977 7f7bd38bc700  0 mgr[devicehealth] Fail
> >>>>>>>> to parse JSON result from daemon osd.51 ()
> >>>>>>>> 2019-11-01 21:33:53.978 7f7bd38bc700  0 mgr[devicehealth] Fail
> >>>>>>>> to parse JSON result from daemon osd.52 ()
> >>>>>>>> 2019-11-01 21:33:53.979 7f7bd38bc700  0 mgr[devicehealth] Fail
> >>>>>>>> to parse JSON result from daemon osd.53 ()
> >>>>>>>> ------------------------------
> >>>>>>>>
> >>>>>>>> The reason why I am cautious about the health metrics is that I
> >>>>>>>> observed a crash when trying to query them:
> >>>>>>>> ------------------------------
> >>>>>>>> 2019-11-01 20:21:23.661 7fa46314a700  0 log_channel(audit) log
> >>>>>>>> [DBG] : from='client.174136 -' entity='client.admin'
> >>>>>>>> cmd=[{"prefix": "device get-health-metrics", "devid": "osd.11",
> >>>>>>>> "target": ["mgr", ""]}]: dispatch
> >>>>>>>> 2019-11-01 20:21:23.661 7fa46394b700  0 mgr[devicehealth]
> >>>>>>>> handle_command
> >>>>>>>> 2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal
> >>>>>>>> (Segmentation fault) **
> >>>>>>>>   in thread 7fa46394b700 thread_name:mgr-fin
> >>>>>>>>
> >>>>>>>>   ceph version 14.2.4
> >>>>>>>> (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
> >>>>>>>>   1: (()+0xf5f0) [0x7fa488cee5f0]
> >>>>>>>>   2: (PyEval_EvalFrameEx()+0x1a9) [0x7fa48aeb50f9]
> >>>>>>>>   3: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
> >>>>>>>>   4: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
> >>>>>>>>   5: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
> >>>>>>>>   6: (PyEval_EvalCodeEx()+0x7ed) [0x7fa48aebe08d]
> >>>>>>>>   7: (()+0x709c8) [0x7fa48ae479c8]
> >>>>>>>>   8: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
> >>>>>>>>   9: (()+0x5aaa5) [0x7fa48ae31aa5]
> >>>>>>>>   10: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
> >>>>>>>>   11: (()+0x4bb95) [0x7fa48ae22b95]
> >>>>>>>>   12: (PyObject_CallMethod()+0xbb) [0x7fa48ae22ecb]
> >>>>>>>>   13: (ActivePyModule::handle_command(std::map<std::string,
> >>>>>>>> boost::variant<std::string, bool, long, double,
> >>>>>>>> std::vector<std::string, std::allocator<std::string> >,
> >>>>>>>> std::vector<long, std::allocator<long> >, std::vector<double,
> >>>>>>>> std::allocator<double> > >, std::less<void>,
> >>>>>>>> std::allocator<std::pair<std::string const,
> >>>>>>>> boost::variant<std::string, bool, long, double,
> >>>>>>>> std::vector<std::string, std::allocator<std::string> >,
> >>>>>>>> std::vector<long, std::allocator<long> >, std::vector<double,
> >>>>>>>> std::allocator<double> > > > > > const&,
> >>>>>>>> ceph::buffer::v14_2_0::list const&,
> >>>>>>>> std::basic_stringstream<char, std::char_traits<char>,
> >>>>>>>> std::allocator<char> >*, std::basic_stringstream<char,
> >>>>>>>> std::char_traits<char>, std::allocator<char> >*)+0x20e)
> >>>>>>>> [0x55c3c1fefc5e]
> >>>>>>>>   14: (()+0x16c23d) [0x55c3c204023d]
> >>>>>>>>   15: (FunctionContext::finish(int)+0x2c) [0x55c3c2001eac]
> >>>>>>>>   16: (Context::complete(int)+0x9) [0x55c3c1ffe659]
> >>>>>>>>   17: (Finisher::finisher_thread_entry()+0x156) [0x7fa48b439cc6]
> >>>>>>>>   18: (()+0x7e65) [0x7fa488ce6e65]
> >>>>>>>>   19: (clone()+0x6d) [0x7fa48799488d]
> >>>>>>>>   NOTE: a copy of the executable, or `objdump -rdS
> >>>>>>>> <executable>` is needed to interpret this.
> >>>>>>>> ------------------------------
> >>>>>>>>
> >>>>>>>> I have issued:
> >>>>>>>> ceph device monitoring off
> >>>>>>>> for now and will keep waiting to see if mgrs go silent again.
> >>>>>>>> If there are any better ideas or this issue is known, let me know.
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>>     Oliver
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>>>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>>>
> >>>>
> >>>>
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> >
> 
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx