A quick update just to close out this thread: After investigating with netstat I found one ceph-osd process had three TCP connections in established state but with no connection state on the peer system (the client node that previously had been using the RBD image). The qemu process on the client had terminated and all connection state had been cleaned up. On the osd node, two of these TCP connections had data in their send queues, and the retransmit timer had reached zero, but for some reason the retransmissions were not happening (confirmed with tcpdump) and the connections were not timing out. The osd node remained in this state for over 24 hours. At this point I’m unable to explain why TCP did not time out the connection, given that the peer had closed the connection. This is a debian jessie system with a stock 4.9.0 kernel, so nothing non-standard about the networking stack. The ceph-osd process seems to have gotten stuck because of this. There were no active operations (according to "ceph daemon OSD ops”) but perhaps that is expected once data is already in the send queue. In addition to the three connections in established state, there were over 100 connections in CLOSE_WAIT state which indicates that it was holding these descriptors open even though the TCP connections had terminated, so the reaping thread was perhaps blocked waiting for the pending I/O to finish. Also, the osd would not accept any new requests associated with the same RBD image. I’m not sure if there is any problem in the ceph code given the misbehaving TCP connection. Better error handling to prevent getting stuck might be appropriate, but I’m not sure until I understand what caused the TCP problem. Finally, the only thing slightly non-standard about our test environment is that we have IPSEC enabled, but that should be independent of the TCP layer. There are no firewalls and ping was working fine. The periodic IKE traffic for IPSEC renegotiation was also working (observed with tcpdump). I will be rerunning the same tests, and if I can reproduce this and make more progress on the cause I’ll report back. Thanks, Phil > On Apr 24, 2017, at 5:16 PM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: > > I would double-check your file descriptor limits on both sides -- OSDs > and the client. 131 sockets shouldn't make a difference. Port is open > on any possible firewalls you have running? > > On Mon, Apr 24, 2017 at 8:14 PM, Phil Lacroute > <lacroute@xxxxxxxxxxxxxxxxxx> wrote: >> Yes it is the correct IP and port: >> >> ceph3:~$ netstat -anp | fgrep 192.168.206.13:6804 >> tcp 0 0 192.168.206.13:6804 0.0.0.0:* LISTEN >> 22934/ceph-osd >> >> I turned up the logging on the osd and I don’t think it received the >> request. However I also noticed a large number of TCP connections to that >> specific osd from the client (192.168.206.17) in CLOSE_WAIT state (131 to be >> exact). I think there may be a bug causing the osd not to close file >> descriptors. Prior to the hang I had been running tests continuously for >> several days so the osd process may have been accumulating open sockets. >> >> I’m still gathering information, but based on that is there anything >> specific that would be helpful to find the problem? >> >> Thanks, >> Phil >> >> On Apr 24, 2017, at 5:01 PM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: >> >> Just to cover all the bases, is 192.168.206.13:6804 really associated >> with a running daemon for OSD 11? >> >> On Mon, Apr 24, 2017 at 4:23 PM, Phil Lacroute >> <lacroute@xxxxxxxxxxxxxxxxxx> wrote: >> >> Jason, >> >> Thanks for the suggestion. That seems to show it is not the OSD that got >> stuck: >> >> ceph7:~$ sudo rbd -c debug/ceph.conf info app/image1 >> … >> 2017-04-24 13:13:49.761076 7f739aefc700 1 -- 192.168.206.17:0/1250293899 >> --> 192.168.206.13:6804/22934 -- osd_op(client.4384.0:3 1.af6f1e38 >> rbd_header.1058238e1f29 [call rbd.get_size,call rbd.get_object_prefix] snapc >> 0=[] ack+read+known_if_redirected e27) v7 -- ?+0 0x7f737c0077f0 con >> 0x7f737c0064e0 >> … >> 2017-04-24 13:14:04.756328 7f73a2880700 1 -- 192.168.206.17:0/1250293899 >> --> 192.168.206.13:6804/22934 -- ping magic: 0 v1 -- ?+0 0x7f7374000fc0 con >> 0x7f737c0064e0 >> >> ceph0:~$ sudo ceph pg map 1.af6f1e38 >> osdmap e27 pg 1.af6f1e38 (1.38) -> up [11,16,2] acting [11,16,2] >> >> ceph3:~$ sudo ceph daemon osd.11 ops >> { >> "ops": [], >> "num_ops": 0 >> } >> >> I repeated this a few times and it’s always the same command and same >> placement group that hangs, but OSD11 has no ops (and neither do OSD16 and >> OSD2, although I think that’s expected). >> >> Is there other tracing I should do on the OSD or something more to look at >> on the client? >> >> Thanks, >> Phil >> >> On Apr 24, 2017, at 12:39 PM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: >> >> On Mon, Apr 24, 2017 at 2:53 PM, Phil Lacroute >> <lacroute@xxxxxxxxxxxxxxxxxx> wrote: >> >> 2017-04-24 11:30:57.058233 7f5512ffd700 1 -- 192.168.206.17:0/3282647735 >> --> 192.168.206.13:6804/22934 -- osd_op(client.4375.0:3 1.af6f1e38 >> rbd_header.1058238e1f29 [call rbd.get_size,call rbd.get_object_prefix] snapc >> 0=[] ack+read+known_if_redirected e27) v7 -- ?+0 0x7f54f40077f0 con >> 0x7f54f40064e0 >> >> >> >> You can attempt to run "ceph daemon osd.XYZ ops" against the >> potentially stuck OSD to figure out what it's stuck doing. >> >> -- >> Jason >> >> >> >> >> >> -- >> Jason >> >> > > > > -- > Jason
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com