Re: help troubleshooting some osd communication problems

Samuel Just <sjust@xxxxxxxxxx> · Thu, 28 Apr 2016 15:49:57 -0700



I'd guess that to make any progress we'll need debug ms = 20 on both
sides of the connection when a message is lost.
-Sam

On Thu, Apr 28, 2016 at 2:38 PM, Mike Lovell <mike.lovell@xxxxxxxxxxxxx> wrote:
> there was a problem on one of the clusters i manage a couple weeks ago where
> pairs of OSDs would wait indefinitely on subops from the other OSD in the
> pair. we used a liberal dose of "ceph osd down ##" on the osds and
> eventually things just sorted them out a couple days later.
>
> it seems to have come back today and co-workers and i are stuck on trying to
> figure out why this is happening. here are the details that i know.
> currently 2 OSDs, 41 and 148, keep waiting on subops from each other
> resulting in lines such as the following in ceph.log.
>
> 2016-04-28 13:29:26.875797 osd.41 10.217.72.22:6802/3769 56283 : cluster
> [WRN] slow request 30.642736 seconds old, received at 2016-04-28
> 13:28:56.233001: osd_op(client.11172360.0:516946146
> rbd_data.36bfe359c4998.0000000000000d08 [set-alloc-hint object_size 4194304
> write_size 4194304,write 1835008~143360] 17.3df49873 RETRY=1
> ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently
> waiting for subops from 5,140,148
>
> 2016-04-28 13:29:28.031452 osd.148 10.217.72.11:6820/6487 25324 : cluster
> [WRN] slow request 30.960922 seconds old, received at 2016-04-28
> 13:28:57.070471: osd_op(client.24127500.0:2960618
> rbd_data.38178d8adeb4d.00000000000010f8 [set-alloc-hint object_size 8388608
> write_size 8388608,write 3194880~4096] 17.fb41a37c RETRY=1
> ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently
> waiting for subops from 41,115
>
> from digging in the logs, it appears like some messages are being lost
> between the OSDs. this is what osd.41 sees:
> ---------------------------------------------
> 2016-04-28 13:28:56.233702 7f3b171e0700  1 -- 10.217.72.22:6802/3769 <==
> client.11172360 10.217.72.41:0/6031968 6 ====
> osd_op(client.11172360.0:516946146 rbd_data.36bfe359c4998.0000000000000d08
> [set-alloc-hint object_size 4194304 write_size 4194304,write 1835008~143360]
> 17.3df49873 RETRY=1 ack+ondisk+retry+write+redirected+known_if_redirected
> e159001) v5 ==== 236+0+143360 (781016428 0 3953649960) 0x1d551c00 con
> 0x1a78d9c0
> 2016-04-28 13:28:56.233983 7f3b49020700  1 -- 10.217.89.22:6825/313003769
> --> 10.217.89.18:6806/1010441 -- osd_repop(client.11172360.0:516946146 17.73
> 3df49873/rbd_data.36bfe359c4998.0000000000000d08/head//17 v 159001'26722799)
> v1 -- ?+46 0x1d6db200 con 0x21add440
> 2016-04-28 13:28:56.234017 7f3b49020700  1 -- 10.217.89.22:6825/313003769
> --> 10.217.89.11:6810/4543 -- osd_repop(client.11172360.0:516946146 17.73
> 3df49873/rbd_data.36bfe359c4998.0000000000000d08/head//17 v 159001'26722799)
> v1 -- ?+46 0x1d6dd000 con 0x21ada000
> 2016-04-28 13:28:56.234046 7f3b49020700  1 -- 10.217.89.22:6825/313003769
> --> 10.217.89.11:6812/43006487 -- osd_repop(client.11172360.0:516946146
> 17.73 3df49873/rbd_data.36bfe359c4998.0000000000000d08/head//17 v
> 159001'26722799) v1 -- ?+144137 0x14becc00 con 0xf2cd4a0
> 2016-04-28 13:28:56.243555 7f3b35976700  1 -- 10.217.89.22:6825/313003769
> <== osd.140 10.217.89.11:6810/4543 23 ====
> osd_repop_reply(client.11172360.0:516946146 17.73 ondisk, result = 0) v1
> ==== 83+0+0 (494696391 0 0) 0x28ea7b00 con 0x21ada000
> 2016-04-28 13:28:56.257816 7f3b27d9b700  1 -- 10.217.89.22:6825/313003769
> <== osd.5 10.217.89.18:6806/1010441 35 ====
> osd_repop_reply(client.11172360.0:516946146 17.73 ondisk, result = 0) v1
> ==== 83+0+0 (2393425574 0 0) 0xfe82fc0 con 0x21add440
>
>
> this, however is what osd.148 sees:
> ---------------------------------------------
> [ulhglive-root@ceph1 ~]# grep :516946146 /var/log/ceph/ceph-osd.148.log
> 2016-04-28 13:29:33.470156 7f195fcfc700  1 -- 10.217.72.11:6820/6487 <==
> client.11172360 10.217.72.41:0/6031968 460 ====
> osd_op(client.11172360.0:516946146 rbd_data.36bfe359c4998.0000000000000d08
> [set-alloc-hint object_size 4194304 write_size 4194304,write 1835008~143360]
> 17.3df49873 RETRY=2 ack+ondisk+retry+write+redirected+known_if_redirected
> e159002) v5 ==== 236+0+143360 (129493315 0 3953649960) 0x1edf2300 con
> 0x24dc0d60
>
> also, due to the ceph osd down commands, there is recovery that needs to
> happen for a PG shared between these OSDs that is never making any progress.
> its probably due to whatever is cause the repops to fail.
>
> i did some tcpdump on both sides limiting things to the ip addresses and
> ports being used by these two OSDs and see packets flowing between the two
> osds. i attempted to have wireshark decode the actual ceph traffic but it
> was only able to get bits and pieces of the ceph protocol bits but at least
> for the moment i'm blaming that on the ceph dissector for wireshark. there
> aren't any dropped or error packets on any of the network interfaces
> involved.
>
> does anyone have any ideas of where to look next or other tips for this?
> we've put debug_ms and debug_osd at 1/1 to get the bits of info mentioned.
> putting them at 20 probably isn't going to be helpful so anyone have a
> suggestion on another level to put it at that might be useful? go figure
> that this would happen while i'm at the openstack summit and it would keep
> me from paying attention to some interesting presentations.
>
> thanks in advance for any help.
>
> mike
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com