Re: hung rbd requests for one pool

Jason Dillaman <jdillama@xxxxxxxxxx> · Mon, 24 Apr 2017 20:16:54 -0400

I would double-check your file descriptor limits on both sides -- OSDs
and the client. 131 sockets shouldn't make a difference. Port is open
on any possible firewalls you have running?

On Mon, Apr 24, 2017 at 8:14 PM, Phil Lacroute
<lacroute@xxxxxxxxxxxxxxxxxx> wrote:
> Yes it is the correct IP and port:
>
> ceph3:~$ netstat -anp | fgrep 192.168.206.13:6804
> tcp        0      0 192.168.206.13:6804     0.0.0.0:*               LISTEN
> 22934/ceph-osd
>
> I turned up the logging on the osd and I don’t think it received the
> request.  However I also noticed a large number of TCP connections to that
> specific osd from the client (192.168.206.17) in CLOSE_WAIT state (131 to be
> exact).  I think there may be a bug causing the osd not to close file
> descriptors.  Prior to the hang I had been running tests continuously for
> several days so the osd process may have been accumulating open sockets.
>
> I’m still gathering information, but based on that is there anything
> specific that would be helpful to find the problem?
>
> Thanks,
> Phil
>
> On Apr 24, 2017, at 5:01 PM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
>
> Just to cover all the bases, is 192.168.206.13:6804 really associated
> with a running daemon for OSD 11?
>
> On Mon, Apr 24, 2017 at 4:23 PM, Phil Lacroute
> <lacroute@xxxxxxxxxxxxxxxxxx> wrote:
>
> Jason,
>
> Thanks for the suggestion.  That seems to show it is not the OSD that got
> stuck:
>
> ceph7:~$ sudo rbd -c debug/ceph.conf info app/image1
> …
> 2017-04-24 13:13:49.761076 7f739aefc700  1 -- 192.168.206.17:0/1250293899
> --> 192.168.206.13:6804/22934 -- osd_op(client.4384.0:3 1.af6f1e38
> rbd_header.1058238e1f29 [call rbd.get_size,call rbd.get_object_prefix] snapc
> 0=[] ack+read+known_if_redirected e27) v7 -- ?+0 0x7f737c0077f0 con
> 0x7f737c0064e0
> …
> 2017-04-24 13:14:04.756328 7f73a2880700  1 -- 192.168.206.17:0/1250293899
> --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f7374000fc0 con
> 0x7f737c0064e0
>
> ceph0:~$ sudo ceph pg map 1.af6f1e38
> osdmap e27 pg 1.af6f1e38 (1.38) -> up [11,16,2] acting [11,16,2]
>
> ceph3:~$ sudo ceph daemon osd.11 ops
> {
>    "ops": [],
>    "num_ops": 0
> }
>
> I repeated this a few times and it’s always the same command and same
> placement group that hangs, but OSD11 has no ops (and neither do OSD16 and
> OSD2, although I think that’s expected).
>
> Is there other tracing I should do on the OSD or something more to look at
> on the client?
>
> Thanks,
> Phil
>
> On Apr 24, 2017, at 12:39 PM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
>
> On Mon, Apr 24, 2017 at 2:53 PM, Phil Lacroute
> <lacroute@xxxxxxxxxxxxxxxxxx> wrote:
>
> 2017-04-24 11:30:57.058233 7f5512ffd700  1 -- 192.168.206.17:0/3282647735
> --> 192.168.206.13:6804/22934 -- osd_op(client.4375.0:3 1.af6f1e38
> rbd_header.1058238e1f29 [call rbd.get_size,call rbd.get_object_prefix] snapc
> 0=[] ack+read+known_if_redirected e27) v7 -- ?+0 0x7f54f40077f0 con
> 0x7f54f40064e0
>
>
>
> You can attempt to run "ceph daemon osd.XYZ ops" against the
> potentially stuck OSD to figure out what it's stuck doing.
>
> --
> Jason
>
>
>
>
>
> --
> Jason
>
>

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com