Re: iostat and dashboard freezing

Reed Dier <reed.dier@xxxxxxxxxxx> · Tue, 27 Aug 2019 08:01:19 -0500

Curious what dist you're running on, as I've been having similar issues with instability in the mgr as well, curious if any similar threads to pull at.

While the iostat command is running, is the active mgr using 100% CPU in top?

Reed

> On Aug 27, 2019, at 6:41 AM, Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> wrote:
> 
> Dear All,
> 
> We have a new Nautilus (14.2.2) cluster, with 328 OSDs spread over 40 nodes.
> 
> Unfortunately "ceph iostat" spends most of it's time frozen, with
> occasional periods of working normally for less than a minute, then
> freeze again for a couple of minutes, then come back to life, and so so
> on...
> 
> No errors are seen on screen, unless I press CTRL+C when iostat is stalled:
> 
> [root@ceph-s3 ~]# ceph iostat
> ^CInterrupted
> Traceback (most recent call last):
>  File "/usr/bin/ceph", line 1263, in <module>
>    retval = main()
>  File "/usr/bin/ceph", line 1194, in main
>    verbose)
>  File "/usr/bin/ceph", line 619, in new_style_command
>    ret, outbuf, outs = do_command(parsed_args, target, cmdargs,
> sigdict, inbuf, verbose)
>  File "/usr/bin/ceph", line 593, in do_command
>    return ret, '', ''
> UnboundLocalError: local variable 'ret' referenced before assignment
> 
> Observations:
> 
> 1) This problem does not seem to be related to load on the cluster.
> 
> 2) When iostat is stalled the dashboard is also non-responsive, if
> iostat is working, the dashboard also works.
> 
> Presumably the iostat and dashboard problems are due to the same
> underlying fault? Perhaps a problem with the mgr?
> 
> 
> 3) With iostat working, tailing /var/log/ceph/ceph-mgr.ceph-s3.log
> shows:
> 
> 2019-08-27 09:09:56.817 7f8149834700  0 log_channel(audit) log [DBG] :
> from='client.4120202 -' entity='client.admin' cmd=[{"width": 95,
> "prefix": "iostat", "poll": true, "target": ["mgr", ""], "print_header":
> false}]: dispatch
> 
> 4) When iostat isn't working, we see no obvious errors in the mgr log.
> 
> 5) When the dashboard is not working, mgr log sometimes shows:
> 
> 2019-08-27 09:18:18.810 7f813e533700  0 mgr[dashboard]
> [::ffff:10.91.192.36:43606] [GET] [500] [2.724s] [jake] [1.6K]
> /api/health/minimal
> 2019-08-27 09:18:18.887 7f813e533700  0 mgr[dashboard] ['{"status": "500
> Internal Server Error", "version": "3.2.2", "detail": "The server
> encountered an unexpected condition which prevented it from fulfilling
> the request.", "traceback": "Traceback (most recent call last):\\n  File
> \\"/usr/lib/python2.7/site-packages/cherrypy/_cprequest.py\\", line 656,
> in respond\\n    response.body = self.handler()\\n  File
> \\"/usr/lib/python2.7/site-packages/cherrypy/lib/encoding.py\\", line
> 188, in __call__\\n    self.body = self.oldhandler(*args, **kwargs)\\n
> File \\"/usr/lib/python2.7/site-packages/cherrypy/_cptools.py\\", line
> 221, in wrap\\n    return self.newhandler(innerfunc, *args, **kwargs)\\n
> File \\"/usr/share/ceph/mgr/dashboard/services/exception.py\\", line
> 88, in dashboard_exception_handler\\n    return handler(*args,
> **kwargs)\\n  File
> \\"/usr/lib/python2.7/site-packages/cherrypy/_cpdispatch.py\\", line 34,
> in __call__\\n    return self.callable(*self.args, **self.kwargs)\\n
> File \\"/usr/share/ceph/mgr/dashboard/controllers/__init__.py\\", line
> 649, in inner\\n    ret = func(*args, **kwargs)\\n  File
> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 192, in
> minimal\\n    return self.health_minimal.all_health()\\n  File
> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 51, in
> all_health\\n    result[\'pools\'] = self.pools()\\n  File
> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 167, in
> pools\\n    pools = CephService.get_pool_list_with_stats()\\n  File
> \\"/usr/share/ceph/mgr/dashboard/services/ceph_service.py\\", line 124,
> in get_pool_list_with_stats\\n    \'series\': [i for i in
> stat_series]\\nRuntimeError: deque mutated during iteration\\n"}']
> 
> 
> 6) IPV6 is normally disabled on our machines at the kernel level, via
> grubby --update-kernel=ALL --args="ipv6.disable=1"
> 
> This was done as 'disabling ipv6' interfered with the dashboard (giving
> "HEALTH_ERR Module 'dashboard' has failed: error('No socket could be
> created',) we re-enabling ipv6 on the mgr nodes only to fix this.
> 
> 
> Ideas...?
> 
> Should ipv6 be enabled, even if not configured, on all ceph nodes?
> 
> Any ideas on fixing this gratefully received!
> 
> many thanks
> 
> Jake
> 
> -- 
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue,
> Cambridge CB2 0QH, UK.
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Attachment:
smime.p7s

Description: S/MIME cryptographic signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com