Re: ?==?utf-8?q? OSD's hang after network blip

"Nick Fisk" <nick@xxxxxxxxxx> · Thu, 16 Jan 2020 11:08:13 +0000

On Thursday, January 16, 2020 09:15 GMT, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: 

> Hi Nick,
> 
> We saw the exact same problem yesterday after a network outage -- a few of
> our down OSDs were stuck down until we restarted their processes.
> 
> -- Dan
> 
> 
> On Wed, Jan 15, 2020 at 3:37 PM Nick Fisk <nick@xxxxxxxxxx> wrote:
> 
> > Hi All,
> >
> > Running 14.2.5, currently experiencing some network blips isolated to a
> > single rack which is under investigation. However, it appears following a
> > network blip, random OSD's in unaffected racks are sometimes not recovering
> > from the incident and are left running running in a zombie state. The OSD's
> > appear to be running from a process perspective, but the cluster thinks
> > they are down and will not rejoin the cluster until the OSD process is
> > restarted, which incidentally takes a lot longer than usual (systemctl
> > command takes a couple of minutes to complete).
> >
> > If the OSD is left in this state, CPU and memory usage of the process
> > appears to climb, but never rejoins, at least for several hours that I have
> > left them. Not exactly sure what the OSD is trying to do during this
> > period. There's nothing in the logs during this hung state to indicate that
> > anything is happening, but I will try and inject more verbose logging next
> > time it occurs.
> >
> > Not sure if anybody has come across this before or any ideas? In the past
> > as long as OSD's have been running they have always re-joint following any
> > network issues.
> >
> > Nick
> >
> > Sample from OSD and cluster logs below. Blip happened at 12:06, I
> > restarted OSD at 12:26
> >
> > OSD Logs from OSD that hung (Note this OSD was not directly affected by
> > network outage)
> > 2020-01-15 12:06:32.234 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::13]:6875 osd.53 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::13]:6894 osd.54 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> > reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:34.034 7f419480a700  0 log_channel(cluster) log [WRN] :
> > Monitor daemon marked osd.43 down, but it is still running
> > 2020-01-15 12:06:34.034 7f419480a700  0 log_channel(cluster) log [DBG] :
> > map e2342992 wrongly marked me down at e2342992
> > 2020-01-15 12:06:34.034 7f419480a700  1 osd.43 2342992
> > start_waiting_for_healthy
> > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> > reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> > reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> > reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> > reply from [*:*:*:5::13]:6875 osd.53 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> > reply from [*:*:*:5::13]:6894 osd.54 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> > 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> > reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first
> > ping sent 2020-01-15 12:06:1
> > 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> >
> > Cluster logs
> > 2020-01-15 12:06:09.740607 mon.mc-ceph-mon1 (mon.0) 531400 : cluster [DBG]
> > osd.43 reported failed by osd.57
> > 2020-01-15 12:06:09.945163 mon.mc-ceph-mon1 (mon.0) 531683 : cluster [DBG]
> > osd.43 reported failed by osd.63
> > 2020-01-15 12:06:09.945287 mon.mc-ceph-mon1 (mon.0) 531684 : cluster [INF]
> > osd.43 failed (root=hdd,rack=8c-hdd,host=mc-8c-osd02-hdd) (2 reporters from
> > different host after 21.006447 >= grace 20.118871)
> > 2020-01-15 12:06:09.962867 mon.mc-ceph-mon1 (mon.0) 531775 : cluster [DBG]
> > osd.43 reported failed by osd.49
> > 2020-01-15 12:06:10.471837 mon.mc-ceph-mon1 (mon.0) 532231 : cluster [DBG]
> > osd.43 reported failed by osd.190
> > 2020-01-15 12:06:12.050928 mon.mc-ceph-mon1 (mon.0) 532421 : cluster [INF]
> > osd.43 [v2:[*:*:*:5::12]:6808/1969300,v1:[*:*:*:5::12]:6809/1969300] boot
> > 2020-01-15 12:06:11.192756 osd.43 (osd.43) 1675 : cluster [WRN] Monitor
> > daemon marked osd.43 down, but it is still running
> > 2020-01-15 12:06:11.192761 osd.43 (osd.43) 1676 : cluster [DBG] map
> > e2342983 wrongly marked me down at e2342983
> > 2020-01-15 12:06:32.240850 mon.mc-ceph-mon1 (mon.0) 533397 : cluster [DBG]
> > osd.49 reported failed by osd.43
> > 2020-01-15 12:06:32.241117 mon.mc-ceph-mon1 (mon.0) 533398 : cluster [DBG]
> > osd.51 reported failed by osd.43
> > 2020-01-15 12:06:32.241247 mon.mc-ceph-mon1 (mon.0) 533399 : cluster [DBG]
> > osd.52 reported failed by osd.43
> > 2020-01-15 12:06:32.241378 mon.mc-ceph-mon1 (mon.0) 533400 : cluster [DBG]
> > osd.53 reported failed by osd.43
> > 2020-01-15 12:06:32.241498 mon.mc-ceph-mon1 (mon.0) 533401 : cluster [DBG]
> > osd.54 reported failed by osd.43
> > 2020-01-15 12:06:32.241680 mon.mc-ceph-mon1 (mon.0) 533402 : cluster [DBG]
> > osd.71 reported failed by osd.43
> > 2020-01-15 12:06:33.374171 mon.mc-ceph-mon1 (mon.0) 533762 : cluster [DBG]
> > osd.43 reported failed by osd.15
> > 2020-01-15 12:06:33.713135 mon.mc-ceph-mon1 (mon.0) 534029 : cluster [DBG]
> > osd.43 reported failed by osd.191
> > 2020-01-15 12:06:33.713227 mon.mc-ceph-mon1 (mon.0) 534030 : cluster [INF]
> > osd.43 failed (root=hdd,rack=8c-hdd,host=mc-8c-osd02-hdd) (2 reporters from
> > different host after 20.002634 >= grace 20.001226)
> > 2020-01-15 12:16:34.202137 mon.mc-ceph-mon1 (mon.0) 537464 : cluster [INF]
> > Marking osd.43 out (has been down for 600 seconds)
> > 2020-01-15 12:26:37.655911 mon.mc-ceph-mon1 (mon.0) 538134 : cluster [INF]
> > osd.43 [v2:[*:*:*:5::12]:6802/1286742,v1:[*:*:*:5::12]:6808/1286742] boot
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

 Hi Dan,

Interesting you have started seeing the same issue, have you recently also upgraded to Nautilus? Not sure if you saw my follow up post, subject got messed up, so didn't get associated to thread. But it looks like the OSD has an outdated map hence not realising that it's current state doesn't match the rest of what the cluster thinks.

Nick

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com