Re: Several ceph osd commands hang

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I would look into a potential network problem.  Check for errors on both the server side and on the switch side.

Otherwise I'm not really sure what's going on.  Someone else will have to jump into the conversation.

Bryan

On Oct 29, 2019, at 10:38 AM, Thomas Schneider <74cmonty@xxxxxxxxx> wrote:
> 
> Notice: This email is from an external sender.
> 
> 
> 
> Thanks.
> 
> 2 of 4 MGR nodes are sick.
> I have stopped MGR services on both nodes.
> 
> When I start the service again on node A, I get this in its log:
> root@ld5508:~# tail -f /var/log/ceph/ceph-mgr.ld5508.log
> 2019-10-29 17:32:02.024 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
> v1:10.97.206.96:7055/17961 conn(0x564582ad5180 0x5645991ca800 :-1
> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
> connect got BADAUTHORIZER
> 2019-10-29 17:32:02.028 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
> v1:10.97.206.96:7055/17961 conn(0x5645977b5180 0x564582b2e800 :-1
> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
> connect got BADAUTHORIZER
> 2019-10-29 17:32:02.032 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
> v1:10.97.206.96:7055/17961 conn(0x564582ad5180 0x5645991ca800 :-1
> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
> connect got BADAUTHORIZER
> 2019-10-29 17:32:02.040 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
> v1:10.97.206.96:7055/17961 conn(0x5645977b5180 0x564582b2e800 :-1
> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
> connect got BADAUTHORIZER
> 2019-10-29 17:32:02.044 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
> v1:10.97.206.96:7055/17961 conn(0x564582ad5180 0x5645991ca800 :-1
> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
> connect got BADAUTHORIZER
> 2019-10-29 17:32:02.048 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
> v1:10.97.206.96:7055/17961 conn(0x5645977b5180 0x564582b2e800 :-1
> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
> connect got BADAUTHORIZER
> 2019-10-29 17:32:02.052 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
> v1:10.97.206.96:7055/17961 conn(0x564582ad5180 0x5645991ca800 :-1
> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
> connect got BADAUTHORIZER
> 2019-10-29 17:32:02.060 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
> v1:10.97.206.96:7055/17961 conn(0x5645977b5180 0x564582b2e800 :-1
> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
> connect got BADAUTHORIZER
> 2019-10-29 17:32:02.064 7fe209fe8700 -1 received  signal: Terminated
> from /sbin/init  (PID: 1) UID: 0
> 2019-10-29 17:32:02.064 7fe209fe8700 -1 mgr handle_signal *** Got signal
> Terminated ***
> 2019-10-29 17:37:54.319 7f0e26fc1dc0  0 set uid:gid to 64045:64045
> (ceph:ceph)
> 2019-10-29 17:37:54.319 7f0e26fc1dc0  0 ceph version 14.2.4
> (65249672c6e6d843510e7e01f8a4b976dcac3db1) nautilus (stable), process
> ceph-mgr, pid 250399
> 2019-10-29 17:37:54.319 7f0e26fc1dc0  0 pidfile_write: ignore empty
> --pid-file
> 2019-10-29 17:37:54.331 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'ansible'
> 2019-10-29 17:37:54.503 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'balancer'
> 2019-10-29 17:37:54.531 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'crash'
> 2019-10-29 17:37:54.551 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'dashboard'
> 2019-10-29 17:37:54.915 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'deepsea'
> 2019-10-29 17:37:55.071 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'devicehealth'
> 2019-10-29 17:37:55.103 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'influx'
> 2019-10-29 17:37:55.127 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'insights'
> 2019-10-29 17:37:55.207 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'iostat'
> 2019-10-29 17:37:55.227 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'localpool'
> 2019-10-29 17:37:55.247 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'orchestrator_cli'
> 2019-10-29 17:37:55.295 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'pg_autoscaler'
> 2019-10-29 17:37:55.347 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'progress'
> 2019-10-29 17:37:55.387 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'prometheus'
> 2019-10-29 17:37:55.599 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'rbd_support'
> 2019-10-29 17:37:55.647 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'restful'
> 2019-10-29 17:37:55.959 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'selftest'
> 2019-10-29 17:37:55.983 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'status'
> 2019-10-29 17:37:56.015 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'telegraf'
> 2019-10-29 17:37:56.051 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'telemetry'
> 2019-10-29 17:37:56.331 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'test_orchestrator'
> 2019-10-29 17:37:56.399 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'volumes'
> 2019-10-29 17:37:56.459 7f0e26fc1dc0  1 mgr[py] Loading python module
> 'zabbix'
> 2019-10-29 17:37:56.503 7f0e21cdd700  1 mgr load Constructed class from
> module: dashboard
> 2019-10-29 17:37:56.503 7f0e214dc700  0 ms_deliver_dispatch: unhandled
> message 0x56346f978400 mon_map magic: 0 v1 from mon.0 v2:10.97.206.93:3300/0
> 2019-10-29 17:37:56.507 7f0e214dc700  0 client.0 ms_handle_reset on
> v2:10.97.206.93:6912/22258
> 2019-10-29 17:37:56.743 7f0e16363700  0 mgr[dashboard]
> [29/Oct/2019:17:37:56] ENGINE Error in HTTPServer.tick
> Traceback (most recent call last):
>  File
> "/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/__init__.py", line
> 2021, in start
>    self.tick()
>  File
> "/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/__init__.py", line
> 2090, in tick
>    s, ssl_env = self.ssl_adapter.wrap(s)
>  File
> "/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/ssl_builtin.py",
> line 67, in wrap
>    server_side=True)
>  File "/usr/lib/python2.7/ssl.py", line 369, in wrap_socket
>    _context=self)
>  File "/usr/lib/python2.7/ssl.py", line 599, in __init__
>    self.do_handshake()
>  File "/usr/lib/python2.7/ssl.py", line 828, in do_handshake
>    self._sslobj.do_handshake()
> error: [Errno 0] Error
> 
> ^C
> 
> This looks like a severe issue.
> 
> 
> Am 29.10.2019 um 17:22 schrieb Bryan Stillwell:
>> On Oct 29, 2019, at 9:44 AM, Thomas Schneider <74cmonty@xxxxxxxxx> wrote:
>>> in my unhealthy cluster I cannot run several ceph osd command because
>>> they hang, e.g.
>>> ceph osd df
>>> ceph osd pg dump
>>> 
>>> Also, ceph balancer status hangs.
>>> 
>>> How can I fix this issue?
>> Check the status of your ceph-mgr processes (restart them if needed and check the logs for more details).  Those are responsible for handling those commands in recent releases.
>> 
>> Bryan
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux