Re: hung rbd requests for one pool

Phil Lacroute <lacroute@xxxxxxxxxxxxxxxxxx> · Mon, 24 Apr 2017 17:14:25 -0700

Yes it is the correct IP and port:
ceph3:~$ netstat -anp | fgrep 192.168.206.13:6804
tcp        0      0 192.168.206.13:6804     0.0.0.0:*               LISTEN      22934/ceph-osd  

I turned up the logging on the osd and I don’t think it received the request.  However I also noticed a large number of TCP connections to that specific osd from the client (192.168.206.17) in CLOSE_WAIT state (131 to be exact).  I think there may be a bug causing the osd not to close file descriptors.  Prior to the hang I had been running tests continuously for several days so the osd process may have been accumulating open sockets.

I’m still gathering information, but based on that is there anything specific that would be helpful to find the problem?

Thanks,
Phil

On Apr 24, 2017, at 5:01 PM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:

Just to cover all the bases, is 192.168.206.13:6804 really associated
with a running daemon for OSD 11?

On Mon, Apr 24, 2017 at 4:23 PM, Phil Lacroute
<lacroute@xxxxxxxxxxxxxxxxxx> wrote:
Jason,

Thanks for the suggestion.  That seems to show it is not the OSD that got
stuck:

ceph7:~$ sudo rbd -c debug/ceph.conf info app/image1
…
2017-04-24 13:13:49.761076 7f739aefc700  1 -- 192.168.206.17:0/1250293899
--> 192.168.206.13:6804/22934 -- osd_op(client.4384.0:3 1.af6f1e38
rbd_header.1058238e1f29 [call rbd.get_size,call rbd.get_object_prefix] snapc
0=[] ack+read+known_if_redirected e27) v7 -- ?+0 0x7f737c0077f0 con
0x7f737c0064e0
…
2017-04-24 13:14:04.756328 7f73a2880700  1 -- 192.168.206.17:0/1250293899
--> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f7374000fc0 con
0x7f737c0064e0

ceph0:~$ sudo ceph pg map 1.af6f1e38
osdmap e27 pg 1.af6f1e38 (1.38) -> up [11,16,2] acting [11,16,2]

ceph3:~$ sudo ceph daemon osd.11 ops
{
    "ops": [],
    "num_ops": 0
}

I repeated this a few times and it’s always the same command and same
placement group that hangs, but OSD11 has no ops (and neither do OSD16 and
OSD2, although I think that’s expected).

Is there other tracing I should do on the OSD or something more to look at
on the client?

Thanks,
Phil

On Apr 24, 2017, at 12:39 PM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:

On Mon, Apr 24, 2017 at 2:53 PM, Phil Lacroute
<lacroute@xxxxxxxxxxxxxxxxxx> wrote:

2017-04-24 11:30:57.058233 7f5512ffd700  1 -- 192.168.206.17:0/3282647735
--> 192.168.206.13:6804/22934 -- osd_op(client.4375.0:3 1.af6f1e38
rbd_header.1058238e1f29 [call rbd.get_size,call rbd.get_object_prefix] snapc
0=[] ack+read+known_if_redirected e27) v7 -- ?+0 0x7f54f40077f0 con
0x7f54f40064e0

You can attempt to run "ceph daemon osd.XYZ ops" against the
potentially stuck OSD to figure out what it's stuck doing.

--
Jason

-- 
Jason

Attachment:
smime.p7s

Description: S/MIME cryptographic signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com