Re: iostat and dashboard freezing

John Hearns <hearnsj@xxxxxxxxxxxxxx> · Tue, 27 Aug 2019 14:44:45 +0100

Try running  gstack  on the ceph mgr process when it is frozen?
This could be a name resolution problem, as you suspect. Maybe gstack will show where the process is 'stuck'and this might be a call to your name resolution service.

On Tue, 27 Aug 2019 at 14:25, Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> wrote:
Whoops, I'm running Scientific Linux 7.6, going to upgrade to 7.7. soon...

thanks

Jake

On 8/27/19 2:22 PM, Jake Grimmett wrote:

> Hi Reed,

> 

> That exactly matches what I'm seeing:

> 

> when iostat is working OK, I see ~5% CPU use by ceph-mgr

> and when iostat freezes, ceph-mgr CPU increases to 100%

> 

> regarding OS, I'm using Scientific Linux 7.7

> Kernel 3.10.0-957.21.3.el7.x86_64

> 

> I'm not sure if the mgr initiates scrubbing, but if so, this could be

> the cause of the "HEALTH_WARN 20 pgs not deep-scrubbed in time" that we see.

> 

> Anyhow, many thanks for your input, please let me know if you have

> further ideas :)

> 

> best,

> 

> Jake

> 

> On 8/27/19 2:01 PM, Reed Dier wrote:

>> Curious what dist you're running on, as I've been having similar issues with instability in the mgr as well, curious if any similar threads to pull at.

>>

>> While the iostat command is running, is the active mgr using 100% CPU in top?

>>

>> Reed

>>

>>> On Aug 27, 2019, at 6:41 AM, Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> wrote:

>>>

>>> Dear All,

>>>

>>> We have a new Nautilus (14.2.2) cluster, with 328 OSDs spread over 40 nodes.

>>>

>>> Unfortunately "ceph iostat" spends most of it's time frozen, with

>>> occasional periods of working normally for less than a minute, then

>>> freeze again for a couple of minutes, then come back to life, and so so

>>> on...

>>>

>>> No errors are seen on screen, unless I press CTRL+C when iostat is stalled:

>>>

>>> [root@ceph-s3 ~]# ceph iostat

>>> ^CInterrupted

>>> Traceback (most recent call last):

>>>  File "/usr/bin/ceph", line 1263, in <module>

>>>    retval = main()

>>>  File "/usr/bin/ceph", line 1194, in main

>>>    verbose)

>>>  File "/usr/bin/ceph", line 619, in new_style_command

>>>    ret, outbuf, outs = do_command(parsed_args, target, cmdargs,

>>> sigdict, inbuf, verbose)

>>>  File "/usr/bin/ceph", line 593, in do_command

>>>    return ret, '', ''

>>> UnboundLocalError: local variable 'ret' referenced before assignment

>>>

>>> Observations:

>>>

>>> 1) This problem does not seem to be related to load on the cluster.

>>>

>>> 2) When iostat is stalled the dashboard is also non-responsive, if

>>> iostat is working, the dashboard also works.

>>>

>>> Presumably the iostat and dashboard problems are due to the same

>>> underlying fault? Perhaps a problem with the mgr?

>>>

>>>

>>> 3) With iostat working, tailing /var/log/ceph/ceph-mgr.ceph-s3.log

>>> shows:

>>>

>>> 2019-08-27 09:09:56.817 7f8149834700  0 log_channel(audit) log [DBG] :

>>> from='client.4120202 -' entity='client.admin' cmd=[{"width": 95,

>>> "prefix": "iostat", "poll": true, "target": ["mgr", ""], "print_header":

>>> false}]: dispatch

>>>

>>> 4) When iostat isn't working, we see no obvious errors in the mgr log.

>>>

>>> 5) When the dashboard is not working, mgr log sometimes shows:

>>>

>>> 2019-08-27 09:18:18.810 7f813e533700  0 mgr[dashboard]

>>> [::ffff:10.91.192.36:43606] [GET] [500] [2.724s] [jake] [1.6K]

>>> /api/health/minimal

>>> 2019-08-27 09:18:18.887 7f813e533700  0 mgr[dashboard] ['{"status": "500

>>> Internal Server Error", "version": "3.2.2", "detail": "The server

>>> encountered an unexpected condition which prevented it from fulfilling

>>> the request.", "traceback": "Traceback (most recent call last):\\n  File

>>> \\"/usr/lib/python2.7/site-packages/cherrypy/_cprequest.py\\", line 656,

>>> in respond\\n    response.body = self.handler()\\n  File

>>> \\"/usr/lib/python2.7/site-packages/cherrypy/lib/encoding.py\\", line

>>> 188, in __call__\\n    self.body = self.oldhandler(*args, **kwargs)\\n

>>> File \\"/usr/lib/python2.7/site-packages/cherrypy/_cptools.py\\", line

>>> 221, in wrap\\n    return self.newhandler(innerfunc, *args, **kwargs)\\n

>>> File \\"/usr/share/ceph/mgr/dashboard/services/exception.py\\", line

>>> 88, in dashboard_exception_handler\\n    return handler(*args,

>>> **kwargs)\\n  File

>>> \\"/usr/lib/python2.7/site-packages/cherrypy/_cpdispatch.py\\", line 34,

>>> in __call__\\n    return self.callable(*self.args, **self.kwargs)\\n

>>> File \\"/usr/share/ceph/mgr/dashboard/controllers/__init__.py\\", line

>>> 649, in inner\\n    ret = func(*args, **kwargs)\\n  File

>>> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 192, in

>>> minimal\\n    return self.health_minimal.all_health()\\n  File

>>> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 51, in

>>> all_health\\n    result[\'pools\'] = self.pools()\\n  File

>>> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 167, in

>>> pools\\n    pools = CephService.get_pool_list_with_stats()\\n  File

>>> \\"/usr/share/ceph/mgr/dashboard/services/ceph_service.py\\", line 124,

>>> in get_pool_list_with_stats\\n    \'series\': [i for i in

>>> stat_series]\\nRuntimeError: deque mutated during iteration\\n"}']

>>>

>>>

>>> 6) IPV6 is normally disabled on our machines at the kernel level, via

>>> grubby --update-kernel=ALL --args="ipv6.disable=1"

>>>

>>> This was done as 'disabling ipv6' interfered with the dashboard (giving

>>> "HEALTH_ERR Module 'dashboard' has failed: error('No socket could be

>>> created',) we re-enabling ipv6 on the mgr nodes only to fix this.

>>>

>>>

>>> Ideas...?

>>>

>>> Should ipv6 be enabled, even if not configured, on all ceph nodes?

>>>

>>> Any ideas on fixing this gratefully received!

>>>

>>> many thanks

>>>

>>> Jake

>>>

>>> -- 

>>> MRC Laboratory of Molecular Biology

>>> Francis Crick Avenue,

>>> Cambridge CB2 0QH, UK.

>>>

>>> _______________________________________________

>>> ceph-users mailing list

>>> ceph-users@xxxxxxxxxxxxxx

>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>

> 

> 

-- 

MRC Laboratory of Molecular Biology

Francis Crick Avenue,

Cambridge CB2 0QH, UK.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com