Re: mgr daemons becoming unresponsive

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Thu, 7 Nov 2019 18:23:59 +0100

Dear Sage,

Am 07.11.19 um 14:33 schrieb Sage Weil:
On Thu, 7 Nov 2019, Thomas Schneider wrote:
Hi,

I have installed package
ceph-mgr_14.2.4-1-gd592e56-1bionic_amd64.deb
manually:
root@ld5505:/home# dpkg --force-depends -i
ceph-mgr_14.2.4-1-gd592e56-1bionic_amd64.deb
(Reading database ... 107461 files and directories currently installed.)
Preparing to unpack ceph-mgr_14.2.4-1-gd592e56-1bionic_amd64.deb ...
Unpacking ceph-mgr (14.2.4-1-gd592e56-1bionic) over
(14.2.4-1-gd592e56-1bionic) ...
dpkg: ceph-mgr: dependency problems, but configuring anyway as you
requested:
  ceph-mgr depends on ceph-base (= 14.2.4-1-gd592e56-1bionic); however:
   Package ceph-base is not configured yet.

Setting up ceph-mgr (14.2.4-1-gd592e56-1bionic) ...

The I restarted ceph-mgr.

However, there's no effect, means the log entries are still the same.

The ceph-mgr package is sufficient.

Note that the only change on top of 14.2.4 is that the mgr devicehealth
module will scrape OSDs only, not mons.

You can probably/hopefully induce the (previously) bad behavior by
triggering a scrape manually with 'ceph device scrape-health-metrics'?

indeed, this is solved now! I get zero response from the command now (it hangs), but this might also be caused by the SELinux issues
Benjemin mentioned earlier ( https://tracker.ceph.com/issues/40683 ). But at least the mgr "silence" is gone :-).

Additionally, I am running the mgr packages since a few hours now and not a single failover happened (even though our third mon is still missing,
it will take even longer than expected to come back...). So device-health can now be left on :-).

Cheers and thanks,
	Oliver

sage

Or should I install dependencies, namely
ceph-base_14.2.4-1-gd592e56-1bionic_amd64.deb, too?
Or any other packages?

Installation from repo fails when using this repo-file:
root@ld5506:~# more /etc/apt/sources.list.d/ceph-shaman.list
deb https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/
bionic main

W: Failed to fetch
https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/dists/bionic/InRelease
500  Internal Server Error [IP: 147.204.6.136 8080]
W: Some index files failed to download. They have been ignored, or old
ones used instead.

Regards
Thomas

Am 07.11.2019 um 10:04 schrieb Oliver Freyermuth:
Dear Thomas,

the most correct thing to do is probably to add the full repo
(the original link was still empty for me, but
https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/ seems
to work).
The commit itself suggests the ceph-mgr package should be sufficient.

I'm still pondering though since our cluster is close to production
(and for now disk health monitoring is disabled) -
but updating the mgrs alone should also be fine with us. I hope to
have time for the experiment later today ;-).

Cheers,
     Oliver

Am 07.11.19 um 08:57 schrieb Thomas Schneider:
Hi,

can you please advise which package(s) should be installed?

Thanks

Am 06.11.2019 um 22:28 schrieb Sage Weil:
My current working theory is that the mgr is getting hung up when it
tries
to scrape the device metrics from the mon.  The 'tell' mechanism
used to
send mon-targetted commands is pretty kludgey/broken in nautilus and
earlier.  It's been rewritten for octopus, but isn't worth
backporting--it
never really caused problems until the devicemanager started using it
heavily.

In any case, this PR just disables scraping of mon devices for
nautilus:

          https://github.com/ceph/ceph/pull/31446

There is a build queued at

https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/d592e56e$

which should get packages in 1-2 hours.

Perhaps you can install that package on the mgr host and try again to
reproduce it again?

I noticed a few other oddities in the logs while looking through them,
like

     https://tracker.ceph.com/issues/42666

which will hopefully have a fix ready for 14.2.5.  I'm not sure
about that
auth error message, though!

sage

On Sat, 2 Nov 2019, Oliver Freyermuth wrote:

Dear Sage,

good news - it happened again, with debug logs!
There's nothing obvious to my eye, it's uploaded as:
0b2d0c09-46f3-4126-aa27-e2d2e8572741
It seems the failure was roughly in parallel to me wanting to
access the dashboard. It must have happened within the last ~5-10
minutes of the log.

I'll now go back to "stable operation", in case you need anything
else, just let me know.

Cheers and all the best,
     Oliver

Am 02.11.19 um 17:38 schrieb Oliver Freyermuth:
Dear Sage,

at least for the simple case:
   ceph device get-health-metrics osd.11
=> mgr crashes (but in that case, it crashes fully, i.e. the
process is gone)
I have now uploaded a verbose log as:
ceph-post-file: e3bd60ad-cbce-4308-8b07-7ebe7998572e

One potential cause of this (and maybe the other issues) might be
because some of our OSDs are on non-JBOD controllers and hence are
made by forming a Raid 0 per disk,
so a simple smartctl on the device will not work (but
-dmegaraid,<number> would be needed).

Now I have both mgrs active again, debug logging on, device health
metrics on again,
and am waiting for them to become silent again. Let's hope the
issue reappears before the disks run full of logs ;-).

Cheers,
     Oliver

Am 02.11.19 um 02:56 schrieb Sage Weil:
On Sat, 2 Nov 2019, Oliver Freyermuth wrote:
Dear Cephers,

interestingly, after:
   ceph device monitoring off
the mgrs seem to be stable now - the active one still went
silent a few minutes later,
but the standby took over and was stable, and restarting the
broken one, it's now stable since an hour, too,
so probably, a restart of the mgr is needed after disabling
device monitoring to get things stable again.

So it seems to be caused by a problem with the device health
metrics. In case this is a red herring and mgrs become instable
again in the next days,
I'll let you know.
If this seems to stabilize things, and you can tolerate inducing the
failure again, reproducing the problem with mgr logs cranked up
(debug_mgr
= 20, debug_ms = 1) would probably give us a good idea of why the
mgr is
hanging.  Let us know!

Thanks,
sage

   >
Cheers,
     Oliver

Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:
Dear Cephers,

this is a 14.2.4 cluster with device health metrics enabled -
since about a day, all mgr daemons go "silent" on me after a
few hours, i.e. "ceph -s" shows:

    cluster:
      id:     269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
      health: HEALTH_WARN
              no active mgr
              1/3 mons down, quorum mon001,mon002
      services:
      mon:        3 daemons, quorum mon001,mon002 (age 57m), out
of quorum: mon003
      mgr:        no daemons active (since 56m)
      ...
(the third mon has a planned outage and will come back in a few
days)

Checking the logs of the mgr daemons, I find some "reset"
messages at the time when it goes "silent", first for the first
mgr:

2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster)
log [DBG] : pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB
data, 2.3 TiB used, 136 TiB / 138 TiB avail
2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0
ms_handle_reset on v2:10.160.16.1:6800/401248
2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster)
log [DBG] : pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB
data, 2.3 TiB used, 136 TiB / 138 TiB avail

and a bit later, on the standby mgr:

2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster)
log [DBG] : pgmap v1798: 1585 pgs: 166 active+clean+snaptrim,
858 active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data,
2.3 TiB used, 136 TiB / 138 TiB avail
2019-11-01 22:18:16.022 7f7be9e72700  0 client.0
ms_handle_reset on v2:10.160.16.2:6800/352196
2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster)
log [DBG] : pgmap v1799: 1585 pgs: 166 active+clean+snaptrim,
858 active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data,
2.3 TiB used, 136 TiB / 138 TiB avail

Interestingly, the dashboard still works, but presents outdated
information, and for example zero I/O going on.
I believe this started to happen mainly after the third mon
went into the known downtime, but I am not fully sure if this
was the trigger, since the cluster is still growing.
It may also have been the addition of 24 more OSDs.

I also find other messages in the mgr logs which seem
problematic, but I am not sure they are related:
------------------------------
2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error
reading OMAP: [errno 22] Failed to operate read op for oid
Traceback (most recent call last):
    File "/usr/share/ceph/mgr/devicehealth/module.py", line 396,
in put_device_metrics
      ioctx.operate_read_op(op, devid)
    File "rados.pyx", line 516, in
rados.requires.wrapper.validate_func
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL
D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
    File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op
(/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554)
InvalidArgumentError: [errno 22] Failed to operate read op for oid
------------------------------
or:
------------------------------
2019-11-01 21:33:53.977 7f7bd38bc700  0 mgr[devicehealth] Fail
to parse JSON result from daemon osd.51 ()
2019-11-01 21:33:53.978 7f7bd38bc700  0 mgr[devicehealth] Fail
to parse JSON result from daemon osd.52 ()
2019-11-01 21:33:53.979 7f7bd38bc700  0 mgr[devicehealth] Fail
to parse JSON result from daemon osd.53 ()
------------------------------

The reason why I am cautious about the health metrics is that I
observed a crash when trying to query them:
------------------------------
2019-11-01 20:21:23.661 7fa46314a700  0 log_channel(audit) log
[DBG] : from='client.174136 -' entity='client.admin'
cmd=[{"prefix": "device get-health-metrics", "devid": "osd.11",
"target": ["mgr", ""]}]: dispatch
2019-11-01 20:21:23.661 7fa46394b700  0 mgr[devicehealth]
handle_command
2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal
(Segmentation fault) **
   in thread 7fa46394b700 thread_name:mgr-fin

   ceph version 14.2.4
(75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
   1: (()+0xf5f0) [0x7fa488cee5f0]
   2: (PyEval_EvalFrameEx()+0x1a9) [0x7fa48aeb50f9]
   3: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
   4: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
   5: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
   6: (PyEval_EvalCodeEx()+0x7ed) [0x7fa48aebe08d]
   7: (()+0x709c8) [0x7fa48ae479c8]
   8: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
   9: (()+0x5aaa5) [0x7fa48ae31aa5]
   10: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
   11: (()+0x4bb95) [0x7fa48ae22b95]
   12: (PyObject_CallMethod()+0xbb) [0x7fa48ae22ecb]
   13: (ActivePyModule::handle_command(std::map<std::string,
boost::variant<std::string, bool, long, double,
std::vector<std::string, std::allocator<std::string> >,
std::vector<long, std::allocator<long> >, std::vector<double,
std::allocator<double> > >, std::less<void>,
std::allocator<std::pair<std::string const,
boost::variant<std::string, bool, long, double,
std::vector<std::string, std::allocator<std::string> >,
std::vector<long, std::allocator<long> >, std::vector<double,
std::allocator<double> > > > > > const&,
ceph::buffer::v14_2_0::list const&,
std::basic_stringstream<char, std::char_traits<char>,
std::allocator<char> >*, std::basic_stringstream<char,
std::char_traits<char>, std::allocator<char> >*)+0x20e)
[0x55c3c1fefc5e]
   14: (()+0x16c23d) [0x55c3c204023d]
   15: (FunctionContext::finish(int)+0x2c) [0x55c3c2001eac]
   16: (Context::complete(int)+0x9) [0x55c3c1ffe659]
   17: (Finisher::finisher_thread_entry()+0x156) [0x7fa48b439cc6]
   18: (()+0x7e65) [0x7fa488ce6e65]
   19: (clone()+0x6d) [0x7fa48799488d]
   NOTE: a copy of the executable, or `objdump -rdS
<executable>` is needed to interpret this.
------------------------------

I have issued:
ceph device monitoring off
for now and will keep waiting to see if mgrs go silent again.
If there are any better ideas or this issue is known, let me know.

Cheers,
     Oliver

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx