Yes it is the correct IP and port:
ceph3:~$ netstat -anp | fgrep 192.168.206.13:6804 tcp 0 0 192.168.206.13:6804 0.0.0.0:* LISTEN 22934/ceph-osd
I turned up the logging on the osd and I don’t think it received the request. However I also noticed a large number of TCP connections to that specific osd from the client (192.168.206.17) in CLOSE_WAIT state (131 to be exact). I think there may be a bug causing the osd not to close file descriptors. Prior to the hang I had been running tests continuously for several days so the osd process may have been accumulating open sockets.
I’m still gathering information, but based on that is there anything specific that would be helpful to find the problem?
Thanks, Phil
Just to cover all the bases, is 192.168.206.13:6804 really associated with a running daemon for OSD 11? On Mon, Apr 24, 2017 at 4:23 PM, Phil Lacroute < lacroute@xxxxxxxxxxxxxxxxxx> wrote: Jason,
Thanks for the suggestion. That seems to show it is not the OSD that got stuck:
ceph7:~$ sudo rbd -c debug/ceph.conf info app/image1 … 2017-04-24 13:13:49.761076 7f739aefc700 1 -- 192.168.206.17:0/1250293899 --> 192.168.206.13:6804/22934 -- osd_op(client.4384.0:3 1.af6f1e38 rbd_header.1058238e1f29 [call rbd.get_size,call rbd.get_object_prefix] snapc 0=[] ack+read+known_if_redirected e27) v7 -- ?+0 0x7f737c0077f0 con 0x7f737c0064e0 … 2017-04-24 13:14:04.756328 7f73a2880700 1 -- 192.168.206.17:0/1250293899 --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f7374000fc0 con 0x7f737c0064e0
ceph0:~$ sudo ceph pg map 1.af6f1e38 osdmap e27 pg 1.af6f1e38 (1.38) -> up [11,16,2] acting [11,16,2]
ceph3:~$ sudo ceph daemon osd.11 ops { "ops": [], "num_ops": 0 }
I repeated this a few times and it’s always the same command and same placement group that hangs, but OSD11 has no ops (and neither do OSD16 and OSD2, although I think that’s expected).
Is there other tracing I should do on the OSD or something more to look at on the client?
Thanks, Phil
On Apr 24, 2017, at 12:39 PM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
On Mon, Apr 24, 2017 at 2:53 PM, Phil Lacroute <lacroute@xxxxxxxxxxxxxxxxxx> wrote:
2017-04-24 11:30:57.058233 7f5512ffd700 1 -- 192.168.206.17:0/3282647735 --> 192.168.206.13:6804/22934 -- osd_op(client.4375.0:3 1.af6f1e38 rbd_header.1058238e1f29 [call rbd.get_size,call rbd.get_object_prefix] snapc 0=[] ack+read+known_if_redirected e27) v7 -- ?+0 0x7f54f40077f0 con 0x7f54f40064e0
You can attempt to run "ceph daemon osd.XYZ ops" against the potentially stuck OSD to figure out what it's stuck doing.
-- Jason
-- Jason
|
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com