there was a problem on one of the clusters i manage a couple weeks ago where pairs of OSDs would wait indefinitely on subops from the other OSD in the pair. we used a liberal dose of "ceph osd down ##" on the osds and eventually things just sorted them out a couple days later.
it seems to have come back today and co-workers and i are stuck on trying to figure out why this is happening. here are the details that i know. currently 2 OSDs, 41 and 148, keep waiting on subops from each other resulting in lines such as the following in ceph.log.
2016-04-28 13:29:26.875797 osd.41 10.217.72.22:6802/3769 56283 : cluster [WRN] slow request 30.642736 seconds old, received at 2016-04-28 13:28:56.233001: osd_op(client.11172360.0:516946146 rbd_data.36bfe359c4998.0000000000000d08 [set-alloc-hint object_size 4194304 write_size 4194304,write 1835008~143360] 17.3df49873 RETRY=1 ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently waiting for subops from 5,140,148
2016-04-28 13:29:28.031452 osd.148 10.217.72.11:6820/6487 25324 : cluster [WRN] slow request 30.960922 seconds old, received at 2016-04-28 13:28:57.070471: osd_op(client.24127500.0:2960618 rbd_data.38178d8adeb4d.00000000000010f8 [set-alloc-hint object_size 8388608 write_size 8388608,write 3194880~4096] 17.fb41a37c RETRY=1 ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently waiting for subops from 41,115
from digging in the logs, it appears like some messages are being lost between the OSDs. this is what osd.41 sees:
---------------------------------------------
2016-04-28 13:28:56.233702 7f3b171e0700 1 -- 10.217.72.22:6802/3769 <== client.11172360 10.217.72.41:0/6031968 6 ==== osd_op(client.11172360.0:516946146 rbd_data.36bfe359c4998.0000000000000d08 [set-alloc-hint object_size 4194304 write_size 4194304,write 1835008~143360] 17.3df49873 RETRY=1 ack+ondisk+retry+write+redirected+known_if_redirected e159001) v5 ==== 236+0+143360 (781016428 0 3953649960) 0x1d551c00 con 0x1a78d9c0
2016-04-28 13:28:56.233983 7f3b49020700 1 -- 10.217.89.22:6825/313003769 --> 10.217.89.18:6806/1010441 -- osd_repop(client.11172360.0:516946146 17.73 3df49873/rbd_data.36bfe359c4998.0000000000000d08/head//17 v 159001'26722799) v1 -- ?+46 0x1d6db200 con 0x21add440
2016-04-28 13:28:56.234017 7f3b49020700 1 -- 10.217.89.22:6825/313003769 --> 10.217.89.11:6810/4543 -- osd_repop(client.11172360.0:516946146 17.73 3df49873/rbd_data.36bfe359c4998.0000000000000d08/head//17 v 159001'26722799) v1 -- ?+46 0x1d6dd000 con 0x21ada000
2016-04-28 13:28:56.234046 7f3b49020700 1 -- 10.217.89.22:6825/313003769 --> 10.217.89.11:6812/43006487 -- osd_repop(client.11172360.0:516946146 17.73 3df49873/rbd_data.36bfe359c4998.0000000000000d08/head//17 v 159001'26722799) v1 -- ?+144137 0x14becc00 con 0xf2cd4a0
2016-04-28 13:28:56.243555 7f3b35976700 1 -- 10.217.89.22:6825/313003769 <== osd.140 10.217.89.11:6810/4543 23 ==== osd_repop_reply(client.11172360.0:516946146 17.73 ondisk, result = 0) v1 ==== 83+0+0 (494696391 0 0) 0x28ea7b00 con 0x21ada000
2016-04-28 13:28:56.257816 7f3b27d9b700 1 -- 10.217.89.22:6825/313003769 <== osd.5 10.217.89.18:6806/1010441 35 ==== osd_repop_reply(client.11172360.0:516946146 17.73 ondisk, result = 0) v1 ==== 83+0+0 (2393425574 0 0) 0xfe82fc0 con 0x21add440
this, however is what osd.148 sees:
---------------------------------------------
[ulhglive-root@ceph1 ~]# grep :516946146 /var/log/ceph/ceph-osd.148.log
2016-04-28 13:29:33.470156 7f195fcfc700 1 -- 10.217.72.11:6820/6487 <== client.11172360 10.217.72.41:0/6031968 460 ==== osd_op(client.11172360.0:516946146 rbd_data.36bfe359c4998.0000000000000d08 [set-alloc-hint object_size 4194304 write_size 4194304,write 1835008~143360] 17.3df49873 RETRY=2 ack+ondisk+retry+write+redirected+known_if_redirected e159002) v5 ==== 236+0+143360 (129493315 0 3953649960) 0x1edf2300 con 0x24dc0d60
also, due to the ceph osd down commands, there is recovery that needs to happen for a PG shared between these OSDs that is never making any progress. its probably due to whatever is cause the repops to fail.
i did some tcpdump on both sides limiting things to the ip addresses and ports being used by these two OSDs and see packets flowing between the two osds. i attempted to have wireshark decode the actual ceph traffic but it was only able to get bits and pieces of the ceph protocol bits but at least for the moment i'm blaming that on the ceph dissector for wireshark. there aren't any dropped or error packets on any of the network interfaces involved.
does anyone have any ideas of where to look next or other tips for this? we've put debug_ms and debug_osd at 1/1 to get the bits of info mentioned. putting them at 20 probably isn't going to be helpful so anyone have a suggestion on another level to put it at that might be useful? go figure that this would happen while i'm at the openstack summit and it would keep me from paying attention to some interesting presentations.
thanks in advance for any help.
mike
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com