Re: Not all OSDs in rack marked as down when the rack fails

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 29 Oct 2020 18:58:17 +0100

Hi Wido,

Could it be one of these?

mon osd min up ratio
mon osd min in ratio

36/120 is 0.3 so it might be one of those magic ratios at play.

Cheers,

Dan

On Thu, 29 Oct 2020, 18:05 Wido den Hollander, <wido@xxxxxxxx> wrote:

> Hi,
>
> I'm investigating an issue where 4 to 5 OSDs in a rack aren't marked as
> down when the network is cut to that rack.
>
> Situation:
>
> - Nautilus cluster
> - 3 racks
> - 120 OSDs, 40 per rack
>
> We performed a test where we turned off the network Top-of-Rack for each
> rack. This worked as expected with two racks, but with the third
> something weird happened.
>
>  From the 40 OSDs which were supposed to be marked as down only 36 were
> marked as down.
>
> In the end it took 15 minutes for all 40 OSDs to be marked as down.
>
> $ ceph config set mon mon_osd_reporter_subtree_level rack
>
> That setting is set to make sure that we only accept reports from other
> racks.
>
> What we saw in the logs for example:
>
> 2020-10-29T03:49:44.409-0400 7fbda185e700 10
> mon.CEPH2-MON1-206-U39@0(leader).osd e107102  osd.51 has 54 reporters,
> 239.856038 grace (20.000000 + 219.856 + 7.43801e-23), max_failed_since
> 2020-10-29T03:47:22.374857-0400
>
> But osd.51 was still not marked as down after 54 reporters have reported
> that it is actually down.
>
> I checked, no ping or other traffic possible to osd.51. Host is
> unreachable.
>
> Another osd was marked as down, but it took a couple of minutes as well:
>
> 2020-10-29T03:50:54.455-0400 7fbda185e700 10
> mon.CEPH2-MON1-206-U39@0(leader).osd e107102  osd.37 has 48 reporters,
> 221.378970 grace (20.000000 + 201.379 + 6.34437e-23), max_failed_since
> 2020-10-29T03:47:12.761584-0400
> 2020-10-29T03:50:54.455-0400 7fbda185e700  1
> mon.CEPH2-MON1-206-U39@0(leader).osd e107102  we have enough reporters
> to mark osd.37 down
>
> In the end osd.51 was marked as down, but only after the MON decided to
> do so:
>
> 2020-10-29T03:53:44.631-0400 7fbda185e700  0 log_channel(cluster) log
> [INF] : osd.51 marked down after no beacon for 903.943390 seconds
> 2020-10-29T03:53:44.631-0400 7fbda185e700 -1
> mon.CEPH2-MON1-206-U39@0(leader).osd e107104 no beacon from osd.51 since
> 2020-10-29T03:38:40.689062-0400, 903.943390 seconds ago.  marking down
>
> I haven't seen this happen before in any cluster. It's also strange that
> this only happens in this rack, the other two racks work fine.
>
> ID    CLASS  WEIGHT      TYPE NAME
>    -1         1545.35999  root default
>
> -206          515.12000      rack 206
>
>    -7           27.94499          host CEPH2-206-U16
> ...
> -207          515.12000      rack 207
>
>   -17           27.94499          host CEPH2-207-U16
> ...
> -208          515.12000      rack 208
>
>   -31           27.94499          host CEPH2-208-U16
> ...
>
> That's how the CRUSHMap looks like. Straight forward and 3x replication
> over 3 racks.
>
> This issue only occurs in rack *207*.
>
> Has anybody seen this before or knows where to start?
>
> Wido
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx