I'd guess that to make any progress we'll need debug ms = 20 on both sides of the connection when a message is lost. -Sam On Thu, Apr 28, 2016 at 2:38 PM, Mike Lovell <mike.lovell@xxxxxxxxxxxxx> wrote: > there was a problem on one of the clusters i manage a couple weeks ago where > pairs of OSDs would wait indefinitely on subops from the other OSD in the > pair. we used a liberal dose of "ceph osd down ##" on the osds and > eventually things just sorted them out a couple days later. > > it seems to have come back today and co-workers and i are stuck on trying to > figure out why this is happening. here are the details that i know. > currently 2 OSDs, 41 and 148, keep waiting on subops from each other > resulting in lines such as the following in ceph.log. > > 2016-04-28 13:29:26.875797 osd.41 10.217.72.22:6802/3769 56283 : cluster > [WRN] slow request 30.642736 seconds old, received at 2016-04-28 > 13:28:56.233001: osd_op(client.11172360.0:516946146 > rbd_data.36bfe359c4998.0000000000000d08 [set-alloc-hint object_size 4194304 > write_size 4194304,write 1835008~143360] 17.3df49873 RETRY=1 > ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently > waiting for subops from 5,140,148 > > 2016-04-28 13:29:28.031452 osd.148 10.217.72.11:6820/6487 25324 : cluster > [WRN] slow request 30.960922 seconds old, received at 2016-04-28 > 13:28:57.070471: osd_op(client.24127500.0:2960618 > rbd_data.38178d8adeb4d.00000000000010f8 [set-alloc-hint object_size 8388608 > write_size 8388608,write 3194880~4096] 17.fb41a37c RETRY=1 > ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently > waiting for subops from 41,115 > > from digging in the logs, it appears like some messages are being lost > between the OSDs. this is what osd.41 sees: > --------------------------------------------- > 2016-04-28 13:28:56.233702 7f3b171e0700 1 -- 10.217.72.22:6802/3769 <== > client.11172360 10.217.72.41:0/6031968 6 ==== > osd_op(client.11172360.0:516946146 rbd_data.36bfe359c4998.0000000000000d08 > [set-alloc-hint object_size 4194304 write_size 4194304,write 1835008~143360] > 17.3df49873 RETRY=1 ack+ondisk+retry+write+redirected+known_if_redirected > e159001) v5 ==== 236+0+143360 (781016428 0 3953649960) 0x1d551c00 con > 0x1a78d9c0 > 2016-04-28 13:28:56.233983 7f3b49020700 1 -- 10.217.89.22:6825/313003769 > --> 10.217.89.18:6806/1010441 -- osd_repop(client.11172360.0:516946146 17.73 > 3df49873/rbd_data.36bfe359c4998.0000000000000d08/head//17 v 159001'26722799) > v1 -- ?+46 0x1d6db200 con 0x21add440 > 2016-04-28 13:28:56.234017 7f3b49020700 1 -- 10.217.89.22:6825/313003769 > --> 10.217.89.11:6810/4543 -- osd_repop(client.11172360.0:516946146 17.73 > 3df49873/rbd_data.36bfe359c4998.0000000000000d08/head//17 v 159001'26722799) > v1 -- ?+46 0x1d6dd000 con 0x21ada000 > 2016-04-28 13:28:56.234046 7f3b49020700 1 -- 10.217.89.22:6825/313003769 > --> 10.217.89.11:6812/43006487 -- osd_repop(client.11172360.0:516946146 > 17.73 3df49873/rbd_data.36bfe359c4998.0000000000000d08/head//17 v > 159001'26722799) v1 -- ?+144137 0x14becc00 con 0xf2cd4a0 > 2016-04-28 13:28:56.243555 7f3b35976700 1 -- 10.217.89.22:6825/313003769 > <== osd.140 10.217.89.11:6810/4543 23 ==== > osd_repop_reply(client.11172360.0:516946146 17.73 ondisk, result = 0) v1 > ==== 83+0+0 (494696391 0 0) 0x28ea7b00 con 0x21ada000 > 2016-04-28 13:28:56.257816 7f3b27d9b700 1 -- 10.217.89.22:6825/313003769 > <== osd.5 10.217.89.18:6806/1010441 35 ==== > osd_repop_reply(client.11172360.0:516946146 17.73 ondisk, result = 0) v1 > ==== 83+0+0 (2393425574 0 0) 0xfe82fc0 con 0x21add440 > > > this, however is what osd.148 sees: > --------------------------------------------- > [ulhglive-root@ceph1 ~]# grep :516946146 /var/log/ceph/ceph-osd.148.log > 2016-04-28 13:29:33.470156 7f195fcfc700 1 -- 10.217.72.11:6820/6487 <== > client.11172360 10.217.72.41:0/6031968 460 ==== > osd_op(client.11172360.0:516946146 rbd_data.36bfe359c4998.0000000000000d08 > [set-alloc-hint object_size 4194304 write_size 4194304,write 1835008~143360] > 17.3df49873 RETRY=2 ack+ondisk+retry+write+redirected+known_if_redirected > e159002) v5 ==== 236+0+143360 (129493315 0 3953649960) 0x1edf2300 con > 0x24dc0d60 > > also, due to the ceph osd down commands, there is recovery that needs to > happen for a PG shared between these OSDs that is never making any progress. > its probably due to whatever is cause the repops to fail. > > i did some tcpdump on both sides limiting things to the ip addresses and > ports being used by these two OSDs and see packets flowing between the two > osds. i attempted to have wireshark decode the actual ceph traffic but it > was only able to get bits and pieces of the ceph protocol bits but at least > for the moment i'm blaming that on the ceph dissector for wireshark. there > aren't any dropped or error packets on any of the network interfaces > involved. > > does anyone have any ideas of where to look next or other tips for this? > we've put debug_ms and debug_osd at 1/1 to get the bits of info mentioned. > putting them at 20 probably isn't going to be helpful so anyone have a > suggestion on another level to put it at that might be useful? go figure > that this would happen while i'm at the openstack summit and it would keep > me from paying attention to some interesting presentations. > > thanks in advance for any help. > > mike > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com