Re: mgr daemons becoming unresponsive

Thomas Schneider <74cmonty@xxxxxxxxx> · Thu, 7 Nov 2019 08:57:48 +0100

Hi,

can you please advise which package(s) should be installed?

Thanks

Am 06.11.2019 um 22:28 schrieb Sage Weil:
> My current working theory is that the mgr is getting hung up when it tries 
> to scrape the device metrics from the mon.  The 'tell' mechanism used to  
> send mon-targetted commands is pretty kludgey/broken in nautilus and 
> earlier.  It's been rewritten for octopus, but isn't worth backporting--it 
> never really caused problems until the devicemanager started using it     
> heavily.
>
> In any case, this PR just disables scraping of mon devices for nautilus:
>
>         https://github.com/ceph/ceph/pull/31446
>
> There is a build queued at
>
>         
> https://shaman.ceph.com/repos/ceph/wip-no-scrape-mons-nautilus/d592e56e$
>
> which should get packages in 1-2 hours.
>
> Perhaps you can install that package on the mgr host and try again to 
> reproduce it again?
>
> I noticed a few other oddities in the logs while looking through them, 
> like
>
> 	https://tracker.ceph.com/issues/42666
>
> which will hopefully have a fix ready for 14.2.5.  I'm not sure about that 
> auth error message, though!
>
> sage
>
>
> On Sat, 2 Nov 2019, Oliver Freyermuth wrote:
>
>> Dear Sage,
>>
>> good news - it happened again, with debug logs! 
>> There's nothing obvious to my eye, it's uploaded as:
>> 0b2d0c09-46f3-4126-aa27-e2d2e8572741
>> It seems the failure was roughly in parallel to me wanting to access the dashboard. It must have happened within the last ~5-10 minutes of the log. 
>>
>> I'll now go back to "stable operation", in case you need anything else, just let me know. 
>>
>> Cheers and all the best,
>> 	Oliver
>>
>> Am 02.11.19 um 17:38 schrieb Oliver Freyermuth:
>>> Dear Sage,
>>>
>>> at least for the simple case:
>>>  ceph device get-health-metrics osd.11
>>> => mgr crashes (but in that case, it crashes fully, i.e. the process is gone)
>>> I have now uploaded a verbose log as:
>>> ceph-post-file: e3bd60ad-cbce-4308-8b07-7ebe7998572e
>>>
>>> One potential cause of this (and maybe the other issues) might be because some of our OSDs are on non-JBOD controllers and hence are made by forming a Raid 0 per disk,
>>> so a simple smartctl on the device will not work (but -dmegaraid,<number> would be needed). 
>>>
>>> Now I have both mgrs active again, debug logging on, device health metrics on again,
>>> and am waiting for them to become silent again. Let's hope the issue reappears before the disks run full of logs ;-). 
>>>
>>> Cheers,
>>> 	Oliver
>>>
>>> Am 02.11.19 um 02:56 schrieb Sage Weil:
>>>> On Sat, 2 Nov 2019, Oliver Freyermuth wrote:
>>>>> Dear Cephers,
>>>>>
>>>>> interestingly, after:
>>>>>  ceph device monitoring off
>>>>> the mgrs seem to be stable now - the active one still went silent a few minutes later,
>>>>> but the standby took over and was stable, and restarting the broken one, it's now stable since an hour, too,
>>>>> so probably, a restart of the mgr is needed after disabling device monitoring to get things stable again. 
>>>>>
>>>>> So it seems to be caused by a problem with the device health metrics. In case this is a red herring and mgrs become instable again in the next days,
>>>>> I'll let you know. 
>>>> If this seems to stabilize things, and you can tolerate inducing the 
>>>> failure again, reproducing the problem with mgr logs cranked up (debug_mgr 
>>>> = 20, debug_ms = 1) would probably give us a good idea of why the mgr is 
>>>> hanging.  Let us know!
>>>>
>>>> Thanks,
>>>> sage
>>>>
>>>>  > 
>>>>> Cheers,
>>>>> 	Oliver
>>>>>
>>>>> Am 01.11.19 um 23:09 schrieb Oliver Freyermuth:
>>>>>> Dear Cephers,
>>>>>>
>>>>>> this is a 14.2.4 cluster with device health metrics enabled - since about a day, all mgr daemons go "silent" on me after a few hours, i.e. "ceph -s" shows:
>>>>>>
>>>>>>   cluster:
>>>>>>     id:     269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9
>>>>>>     health: HEALTH_WARN
>>>>>>             no active mgr
>>>>>>             1/3 mons down, quorum mon001,mon002
>>>>>>  
>>>>>>   services:
>>>>>>     mon:        3 daemons, quorum mon001,mon002 (age 57m), out of quorum: mon003
>>>>>>     mgr:        no daemons active (since 56m)
>>>>>>     ...
>>>>>> (the third mon has a planned outage and will come back in a few days)
>>>>>>
>>>>>> Checking the logs of the mgr daemons, I find some "reset" messages at the time when it goes "silent", first for the first mgr:
>>>>>>
>>>>>> 2019-11-01 21:34:40.286 7f2df6a6b700  0 log_channel(cluster) log [DBG] : pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail
>>>>>> 2019-11-01 21:34:41.458 7f2e0d59b700  0 client.0 ms_handle_reset on v2:10.160.16.1:6800/401248
>>>>>> 2019-11-01 21:34:42.287 7f2df6a6b700  0 log_channel(cluster) log [DBG] : pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail
>>>>>>
>>>>>> and a bit later, on the standby mgr:
>>>>>>
>>>>>> 2019-11-01 22:18:14.892 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail
>>>>>> 2019-11-01 22:18:16.022 7f7be9e72700  0 client.0 ms_handle_reset on v2:10.160.16.2:6800/352196
>>>>>> 2019-11-01 22:18:16.893 7f7bcc8ae700  0 log_channel(cluster) log [DBG] : pgmap v1799: 1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail
>>>>>>
>>>>>> Interestingly, the dashboard still works, but presents outdated information, and for example zero I/O going on. 
>>>>>> I believe this started to happen mainly after the third mon went into the known downtime, but I am not fully sure if this was the trigger, since the cluster is still growing. 
>>>>>> It may also have been the addition of 24 more OSDs. 
>>>>>>
>>>>>>
>>>>>> I also find other messages in the mgr logs which seem problematic, but I am not sure they are related:
>>>>>> ------------------------------
>>>>>> 2019-11-01 21:17:09.849 7f2df4266700  0 mgr[devicehealth] Error reading OMAP: [errno 22] Failed to operate read op for oid 
>>>>>> Traceback (most recent call last):
>>>>>>   File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, in put_device_metrics
>>>>>>     ioctx.operate_read_op(op, devid)
>>>>>>   File "rados.pyx", line 516, in rados.requires.wrapper.validate_func (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL
>>>>>> D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721)
>>>>>>   File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554)
>>>>>> InvalidArgumentError: [errno 22] Failed to operate read op for oid 
>>>>>> ------------------------------
>>>>>> or:
>>>>>> ------------------------------
>>>>>> 2019-11-01 21:33:53.977 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON result from daemon osd.51 ()
>>>>>> 2019-11-01 21:33:53.978 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON result from daemon osd.52 ()
>>>>>> 2019-11-01 21:33:53.979 7f7bd38bc700  0 mgr[devicehealth] Fail to parse JSON result from daemon osd.53 ()
>>>>>> ------------------------------
>>>>>>
>>>>>> The reason why I am cautious about the health metrics is that I observed a crash when trying to query them:
>>>>>> ------------------------------
>>>>>> 2019-11-01 20:21:23.661 7fa46314a700  0 log_channel(audit) log [DBG] : from='client.174136 -' entity='client.admin' cmd=[{"prefix": "device get-health-metrics", "devid": "osd.11", "target": ["mgr", ""]}]: dispatch
>>>>>> 2019-11-01 20:21:23.661 7fa46394b700  0 mgr[devicehealth] handle_command
>>>>>> 2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal (Segmentation fault) **
>>>>>>  in thread 7fa46394b700 thread_name:mgr-fin
>>>>>>
>>>>>>  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
>>>>>>  1: (()+0xf5f0) [0x7fa488cee5f0]
>>>>>>  2: (PyEval_EvalFrameEx()+0x1a9) [0x7fa48aeb50f9]
>>>>>>  3: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
>>>>>>  4: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
>>>>>>  5: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d]
>>>>>>  6: (PyEval_EvalCodeEx()+0x7ed) [0x7fa48aebe08d]
>>>>>>  7: (()+0x709c8) [0x7fa48ae479c8]
>>>>>>  8: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
>>>>>>  9: (()+0x5aaa5) [0x7fa48ae31aa5]
>>>>>>  10: (PyObject_Call()+0x43) [0x7fa48ae22ab3]
>>>>>>  11: (()+0x4bb95) [0x7fa48ae22b95]
>>>>>>  12: (PyObject_CallMethod()+0xbb) [0x7fa48ae22ecb]
>>>>>>  13: (ActivePyModule::handle_command(std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::buffer::v14_2_0::list const&, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >*, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >*)+0x20e) [0x55c3c1fefc5e]
>>>>>>  14: (()+0x16c23d) [0x55c3c204023d]
>>>>>>  15: (FunctionContext::finish(int)+0x2c) [0x55c3c2001eac]
>>>>>>  16: (Context::complete(int)+0x9) [0x55c3c1ffe659]
>>>>>>  17: (Finisher::finisher_thread_entry()+0x156) [0x7fa48b439cc6]
>>>>>>  18: (()+0x7e65) [0x7fa488ce6e65]
>>>>>>  19: (clone()+0x6d) [0x7fa48799488d]
>>>>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>> ------------------------------
>>>>>>
>>>>>> I have issued:
>>>>>> ceph device monitoring off
>>>>>> for now and will keep waiting to see if mgrs go silent again. If there are any better ideas or this issue is known, let me know. 
>>>>>>
>>>>>> Cheers,
>>>>>> 	Oliver
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx