I would double-check your file descriptor limits on both sides -- OSDs and the client. 131 sockets shouldn't make a difference. Port is open on any possible firewalls you have running? On Mon, Apr 24, 2017 at 8:14 PM, Phil Lacroute <lacroute@xxxxxxxxxxxxxxxxxx> wrote: > Yes it is the correct IP and port: > > ceph3:~$ netstat -anp | fgrep 192.168.206.13:6804 > tcp 0 0 192.168.206.13:6804 0.0.0.0:* LISTEN > 22934/ceph-osd > > I turned up the logging on the osd and I don’t think it received the > request. However I also noticed a large number of TCP connections to that > specific osd from the client (192.168.206.17) in CLOSE_WAIT state (131 to be > exact). I think there may be a bug causing the osd not to close file > descriptors. Prior to the hang I had been running tests continuously for > several days so the osd process may have been accumulating open sockets. > > I’m still gathering information, but based on that is there anything > specific that would be helpful to find the problem? > > Thanks, > Phil > > On Apr 24, 2017, at 5:01 PM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: > > Just to cover all the bases, is 192.168.206.13:6804 really associated > with a running daemon for OSD 11? > > On Mon, Apr 24, 2017 at 4:23 PM, Phil Lacroute > <lacroute@xxxxxxxxxxxxxxxxxx> wrote: > > Jason, > > Thanks for the suggestion. That seems to show it is not the OSD that got > stuck: > > ceph7:~$ sudo rbd -c debug/ceph.conf info app/image1 > … > 2017-04-24 13:13:49.761076 7f739aefc700 1 -- 192.168.206.17:0/1250293899 > --> 192.168.206.13:6804/22934 -- osd_op(client.4384.0:3 1.af6f1e38 > rbd_header.1058238e1f29 [call rbd.get_size,call rbd.get_object_prefix] snapc > 0=[] ack+read+known_if_redirected e27) v7 -- ?+0 0x7f737c0077f0 con > 0x7f737c0064e0 > … > 2017-04-24 13:14:04.756328 7f73a2880700 1 -- 192.168.206.17:0/1250293899 > --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f7374000fc0 con > 0x7f737c0064e0 > > ceph0:~$ sudo ceph pg map 1.af6f1e38 > osdmap e27 pg 1.af6f1e38 (1.38) -> up [11,16,2] acting [11,16,2] > > ceph3:~$ sudo ceph daemon osd.11 ops > { > "ops": [], > "num_ops": 0 > } > > I repeated this a few times and it’s always the same command and same > placement group that hangs, but OSD11 has no ops (and neither do OSD16 and > OSD2, although I think that’s expected). > > Is there other tracing I should do on the OSD or something more to look at > on the client? > > Thanks, > Phil > > On Apr 24, 2017, at 12:39 PM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: > > On Mon, Apr 24, 2017 at 2:53 PM, Phil Lacroute > <lacroute@xxxxxxxxxxxxxxxxxx> wrote: > > 2017-04-24 11:30:57.058233 7f5512ffd700 1 -- 192.168.206.17:0/3282647735 > --> 192.168.206.13:6804/22934 -- osd_op(client.4375.0:3 1.af6f1e38 > rbd_header.1058238e1f29 [call rbd.get_size,call rbd.get_object_prefix] snapc > 0=[] ack+read+known_if_redirected e27) v7 -- ?+0 0x7f54f40077f0 con > 0x7f54f40064e0 > > > > You can attempt to run "ceph daemon osd.XYZ ops" against the > potentially stuck OSD to figure out what it's stuck doing. > > -- > Jason > > > > > > -- > Jason > > -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com