i attempted to grab some logs from the two osds in questions with debug_ms and debug_osd at 20. i have looked through them a little bit but digging through the logs at this verbosity is something i don't have much experience with. hopefully someone on the list can help make sense of it. the logs are at these urls.
it last one is a trimmed portion of the ceph.log from one of the monitors for the time frame the osd logs cover. to make these, i moved the existing log file, set the increased verbosity, had the osds reopen their log files, gave it a few minutes, moved the log files again, and had the osds reopen their logs a second time. this resulted in something that is hopefully just enough context to see whats going on.
i did a 'ceph osd down 41' at about the 20:40:06 mark and the cluster seems to report normal data for the next 30 seconds. after that, the slow io messages from both osds about ops from each other start appearing. i tried tracing a few ops in both logs but couldn't make sense of it. can anyone help me with taking a look and/or pointers about how to understand what's going on?
oh. this is 0.94.5. the basic cluster layout is two racks with 9 nodes in each rack with either 12 or 14 osds per node. ssd cache tiering is being used. the pools are just replicated ones with a size of 3. here is the data from pg dump for the pg that isn't making progress on recovery, which i'm guessing is a result of the same problem. the workload is a bunch of vms with rbd.
pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
17.73 14545 0 14545 14547 0 62650182662 10023 10023 active+undersized+degraded+remapped+backfilling 2016-04-29 01:59:42.148644 161768'26740604 161768:37459478 [140,5,41] 140 [41,148] 41 55547'11246156 2015-09-15 08:53:32.724322 53282'7470580 2015-09-01 07:19:45.054261
i also wonder if just taking 148 out of the cluster (probably just marking it out) would help. the min size is 2 but, since there are other osds in the up set (140 and 5), will the cluster keep working? or will it block until the PG has finished with recovery to the new osds?
thanks in advance. hopefully someone can help soon because right now the only thing holding things together right now is a while loop doing an 'ceph osd down 41' every minute. :(
mike
On Thu, Apr 28, 2016 at 5:49 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
I'd guess that to make any progress we'll need debug ms = 20 on both
sides of the connection when a message is lost.
-Sam
> _______________________________________________
On Thu, Apr 28, 2016 at 2:38 PM, Mike Lovell <mike.lovell@xxxxxxxxxxxxx> wrote:
> there was a problem on one of the clusters i manage a couple weeks ago where
> pairs of OSDs would wait indefinitely on subops from the other OSD in the
> pair. we used a liberal dose of "ceph osd down ##" on the osds and
> eventually things just sorted them out a couple days later.
>
> it seems to have come back today and co-workers and i are stuck on trying to
> figure out why this is happening. here are the details that i know.
> currently 2 OSDs, 41 and 148, keep waiting on subops from each other
> resulting in lines such as the following in ceph.log.
>
> 2016-04-28 13:29:26.875797 osd.41 10.217.72.22:6802/3769 56283 : cluster
> [WRN] slow request 30.642736 seconds old, received at 2016-04-28
> 13:28:56.233001: osd_op(client.11172360.0:516946146
> rbd_data.36bfe359c4998.0000000000000d08 [set-alloc-hint object_size 4194304
> write_size 4194304,write 1835008~143360] 17.3df49873 RETRY=1
> ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently
> waiting for subops from 5,140,148
>
> 2016-04-28 13:29:28.031452 osd.148 10.217.72.11:6820/6487 25324 : cluster
> [WRN] slow request 30.960922 seconds old, received at 2016-04-28
> 13:28:57.070471: osd_op(client.24127500.0:2960618
> rbd_data.38178d8adeb4d.00000000000010f8 [set-alloc-hint object_size 8388608
> write_size 8388608,write 3194880~4096] 17.fb41a37c RETRY=1
> ack+ondisk+retry+write+redirected+known_if_redirected e159001) currently
> waiting for subops from 41,115
>
> from digging in the logs, it appears like some messages are being lost
> between the OSDs. this is what osd.41 sees:
> ---------------------------------------------
> 2016-04-28 13:28:56.233702 7f3b171e0700 1 -- 10.217.72.22:6802/3769 <==
> client.11172360 10.217.72.41:0/6031968 6 ====
> osd_op(client.11172360.0:516946146 rbd_data.36bfe359c4998.0000000000000d08
> [set-alloc-hint object_size 4194304 write_size 4194304,write 1835008~143360]
> 17.3df49873 RETRY=1 ack+ondisk+retry+write+redirected+known_if_redirected
> e159001) v5 ==== 236+0+143360 (781016428 0 3953649960) 0x1d551c00 con
> 0x1a78d9c0
> 2016-04-28 13:28:56.233983 7f3b49020700 1 -- 10.217.89.22:6825/313003769
> --> 10.217.89.18:6806/1010441 -- osd_repop(client.11172360.0:516946146 17.73
> 3df49873/rbd_data.36bfe359c4998.0000000000000d08/head//17 v 159001'26722799)
> v1 -- ?+46 0x1d6db200 con 0x21add440
> 2016-04-28 13:28:56.234017 7f3b49020700 1 -- 10.217.89.22:6825/313003769
> --> 10.217.89.11:6810/4543 -- osd_repop(client.11172360.0:516946146 17.73
> 3df49873/rbd_data.36bfe359c4998.0000000000000d08/head//17 v 159001'26722799)
> v1 -- ?+46 0x1d6dd000 con 0x21ada000
> 2016-04-28 13:28:56.234046 7f3b49020700 1 -- 10.217.89.22:6825/313003769
> --> 10.217.89.11:6812/43006487 -- osd_repop(client.11172360.0:516946146
> 17.73 3df49873/rbd_data.36bfe359c4998.0000000000000d08/head//17 v
> 159001'26722799) v1 -- ?+144137 0x14becc00 con 0xf2cd4a0
> 2016-04-28 13:28:56.243555 7f3b35976700 1 -- 10.217.89.22:6825/313003769
> <== osd.140 10.217.89.11:6810/4543 23 ====
> osd_repop_reply(client.11172360.0:516946146 17.73 ondisk, result = 0) v1
> ==== 83+0+0 (494696391 0 0) 0x28ea7b00 con 0x21ada000
> 2016-04-28 13:28:56.257816 7f3b27d9b700 1 -- 10.217.89.22:6825/313003769
> <== osd.5 10.217.89.18:6806/1010441 35 ====
> osd_repop_reply(client.11172360.0:516946146 17.73 ondisk, result = 0) v1
> ==== 83+0+0 (2393425574 0 0) 0xfe82fc0 con 0x21add440
>
>
> this, however is what osd.148 sees:
> ---------------------------------------------
> [ulhglive-root@ceph1 ~]# grep :516946146 /var/log/ceph/ceph-osd.148.log
> 2016-04-28 13:29:33.470156 7f195fcfc700 1 -- 10.217.72.11:6820/6487 <==
> client.11172360 10.217.72.41:0/6031968 460 ====
> osd_op(client.11172360.0:516946146 rbd_data.36bfe359c4998.0000000000000d08
> [set-alloc-hint object_size 4194304 write_size 4194304,write 1835008~143360]
> 17.3df49873 RETRY=2 ack+ondisk+retry+write+redirected+known_if_redirected
> e159002) v5 ==== 236+0+143360 (129493315 0 3953649960) 0x1edf2300 con
> 0x24dc0d60
>
> also, due to the ceph osd down commands, there is recovery that needs to
> happen for a PG shared between these OSDs that is never making any progress.
> its probably due to whatever is cause the repops to fail.
>
> i did some tcpdump on both sides limiting things to the ip addresses and
> ports being used by these two OSDs and see packets flowing between the two
> osds. i attempted to have wireshark decode the actual ceph traffic but it
> was only able to get bits and pieces of the ceph protocol bits but at least
> for the moment i'm blaming that on the ceph dissector for wireshark. there
> aren't any dropped or error packets on any of the network interfaces
> involved.
>
> does anyone have any ideas of where to look next or other tips for this?
> we've put debug_ms and debug_osd at 1/1 to get the bits of info mentioned.
> putting them at 20 probably isn't going to be helpful so anyone have a
> suggestion on another level to put it at that might be useful? go figure
> that this would happen while i'm at the openstack summit and it would keep
> me from paying attention to some interesting presentations.
>
> thanks in advance for any help.
>
> mike
>
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com