Re: Several ceph osd commands hang

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have checked the network already.

There's no indication for a problem with the network, means there are no
dropped packages and loadtest with iperf shows good performance.



Am 29.10.2019 um 17:44 schrieb Bryan Stillwell:
> I would look into a potential network problem.  Check for errors on both the server side and on the switch side.
>
> Otherwise I'm not really sure what's going on.  Someone else will have to jump into the conversation.
>
> Bryan
>
> On Oct 29, 2019, at 10:38 AM, Thomas Schneider <74cmonty@xxxxxxxxx> wrote:
>> Notice: This email is from an external sender.
>>
>>
>>
>> Thanks.
>>
>> 2 of 4 MGR nodes are sick.
>> I have stopped MGR services on both nodes.
>>
>> When I start the service again on node A, I get this in its log:
>> root@ld5508:~# tail -f /var/log/ceph/ceph-mgr.ld5508.log
>> 2019-10-29 17:32:02.024 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
>> v1:10.97.206.96:7055/17961 conn(0x564582ad5180 0x5645991ca800 :-1
>> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
>> connect got BADAUTHORIZER
>> 2019-10-29 17:32:02.028 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
>> v1:10.97.206.96:7055/17961 conn(0x5645977b5180 0x564582b2e800 :-1
>> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
>> connect got BADAUTHORIZER
>> 2019-10-29 17:32:02.032 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
>> v1:10.97.206.96:7055/17961 conn(0x564582ad5180 0x5645991ca800 :-1
>> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
>> connect got BADAUTHORIZER
>> 2019-10-29 17:32:02.040 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
>> v1:10.97.206.96:7055/17961 conn(0x5645977b5180 0x564582b2e800 :-1
>> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
>> connect got BADAUTHORIZER
>> 2019-10-29 17:32:02.044 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
>> v1:10.97.206.96:7055/17961 conn(0x564582ad5180 0x5645991ca800 :-1
>> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
>> connect got BADAUTHORIZER
>> 2019-10-29 17:32:02.048 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
>> v1:10.97.206.96:7055/17961 conn(0x5645977b5180 0x564582b2e800 :-1
>> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
>> connect got BADAUTHORIZER
>> 2019-10-29 17:32:02.052 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
>> v1:10.97.206.96:7055/17961 conn(0x564582ad5180 0x5645991ca800 :-1
>> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
>> connect got BADAUTHORIZER
>> 2019-10-29 17:32:02.060 7fe20e881700  0 --1- 10.97.206.96:0/201758478 >>
>> v1:10.97.206.96:7055/17961 conn(0x5645977b5180 0x564582b2e800 :-1
>> s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
>> connect got BADAUTHORIZER
>> 2019-10-29 17:32:02.064 7fe209fe8700 -1 received  signal: Terminated
>> from /sbin/init  (PID: 1) UID: 0
>> 2019-10-29 17:32:02.064 7fe209fe8700 -1 mgr handle_signal *** Got signal
>> Terminated ***
>> 2019-10-29 17:37:54.319 7f0e26fc1dc0  0 set uid:gid to 64045:64045
>> (ceph:ceph)
>> 2019-10-29 17:37:54.319 7f0e26fc1dc0  0 ceph version 14.2.4
>> (65249672c6e6d843510e7e01f8a4b976dcac3db1) nautilus (stable), process
>> ceph-mgr, pid 250399
>> 2019-10-29 17:37:54.319 7f0e26fc1dc0  0 pidfile_write: ignore empty
>> --pid-file
>> 2019-10-29 17:37:54.331 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'ansible'
>> 2019-10-29 17:37:54.503 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'balancer'
>> 2019-10-29 17:37:54.531 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'crash'
>> 2019-10-29 17:37:54.551 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'dashboard'
>> 2019-10-29 17:37:54.915 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'deepsea'
>> 2019-10-29 17:37:55.071 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'devicehealth'
>> 2019-10-29 17:37:55.103 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'influx'
>> 2019-10-29 17:37:55.127 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'insights'
>> 2019-10-29 17:37:55.207 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'iostat'
>> 2019-10-29 17:37:55.227 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'localpool'
>> 2019-10-29 17:37:55.247 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'orchestrator_cli'
>> 2019-10-29 17:37:55.295 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'pg_autoscaler'
>> 2019-10-29 17:37:55.347 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'progress'
>> 2019-10-29 17:37:55.387 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'prometheus'
>> 2019-10-29 17:37:55.599 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'rbd_support'
>> 2019-10-29 17:37:55.647 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'restful'
>> 2019-10-29 17:37:55.959 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'selftest'
>> 2019-10-29 17:37:55.983 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'status'
>> 2019-10-29 17:37:56.015 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'telegraf'
>> 2019-10-29 17:37:56.051 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'telemetry'
>> 2019-10-29 17:37:56.331 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'test_orchestrator'
>> 2019-10-29 17:37:56.399 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'volumes'
>> 2019-10-29 17:37:56.459 7f0e26fc1dc0  1 mgr[py] Loading python module
>> 'zabbix'
>> 2019-10-29 17:37:56.503 7f0e21cdd700  1 mgr load Constructed class from
>> module: dashboard
>> 2019-10-29 17:37:56.503 7f0e214dc700  0 ms_deliver_dispatch: unhandled
>> message 0x56346f978400 mon_map magic: 0 v1 from mon.0 v2:10.97.206.93:3300/0
>> 2019-10-29 17:37:56.507 7f0e214dc700  0 client.0 ms_handle_reset on
>> v2:10.97.206.93:6912/22258
>> 2019-10-29 17:37:56.743 7f0e16363700  0 mgr[dashboard]
>> [29/Oct/2019:17:37:56] ENGINE Error in HTTPServer.tick
>> Traceback (most recent call last):
>>  File
>> "/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/__init__.py", line
>> 2021, in start
>>    self.tick()
>>  File
>> "/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/__init__.py", line
>> 2090, in tick
>>    s, ssl_env = self.ssl_adapter.wrap(s)
>>  File
>> "/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/ssl_builtin.py",
>> line 67, in wrap
>>    server_side=True)
>>  File "/usr/lib/python2.7/ssl.py", line 369, in wrap_socket
>>    _context=self)
>>  File "/usr/lib/python2.7/ssl.py", line 599, in __init__
>>    self.do_handshake()
>>  File "/usr/lib/python2.7/ssl.py", line 828, in do_handshake
>>    self._sslobj.do_handshake()
>> error: [Errno 0] Error
>>
>> ^C
>>
>> This looks like a severe issue.
>>
>>
>> Am 29.10.2019 um 17:22 schrieb Bryan Stillwell:
>>> On Oct 29, 2019, at 9:44 AM, Thomas Schneider <74cmonty@xxxxxxxxx> wrote:
>>>> in my unhealthy cluster I cannot run several ceph osd command because
>>>> they hang, e.g.
>>>> ceph osd df
>>>> ceph osd pg dump
>>>>
>>>> Also, ceph balancer status hangs.
>>>>
>>>> How can I fix this issue?
>>> Check the status of your ceph-mgr processes (restart them if needed and check the logs for more details).  Those are responsible for handling those commands in recent releases.
>>>
>>> Bryan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux