Re: Not all OSDs in rack marked as down when the rack fails

Wido den Hollander <wido@xxxxxxxx> · Fri, 30 Oct 2020 11:28:06 +0100

On 29/10/2020 18:58, Dan van der Ster wrote:
Hi Wido,

Could it be one of these?

mon osd min up ratio
mon osd min in ratio

36/120 is 0.3 so it might be one of those magic ratios at play.

I thought of those settings and looked at it. The weird thing is that 
all 3 racks are equal and it works as expected in the other racks. There 
all 40 OSDs are marked as down properly.

These settings should also yield log messages in the MON's log, but they 
don't. Searching for 'ratio' in the logfile doesn't show me anything.

It's weird that osd.51 was only marked as down after 15 minutes because 
it didn't send a beacon to the MON. Other OSDs kept sending reports that 
it was down, but the MONs simply didn't act on it.

Wido

Cheers,

Dan

On Thu, 29 Oct 2020, 18:05 Wido den Hollander, <wido@xxxxxxxx 
<mailto:wido@xxxxxxxx>> wrote:

    Hi,

    I'm investigating an issue where 4 to 5 OSDs in a rack aren't marked as
    down when the network is cut to that rack.

    Situation:

    - Nautilus cluster
    - 3 racks
    - 120 OSDs, 40 per rack

    We performed a test where we turned off the network Top-of-Rack for
    each
    rack. This worked as expected with two racks, but with the third
    something weird happened.

      From the 40 OSDs which were supposed to be marked as down only 36
    were
    marked as down.

    In the end it took 15 minutes for all 40 OSDs to be marked as down.

    $ ceph config set mon mon_osd_reporter_subtree_level rack

    That setting is set to make sure that we only accept reports from other
    racks.

    What we saw in the logs for example:

    2020-10-29T03:49:44.409-0400 7fbda185e700 10
    mon.CEPH2-MON1-206-U39@0(leader).osd e107102  osd.51 has 54 reporters,
    239.856038 grace (20.000000 + 219.856 + 7.43801e-23), max_failed_since
    2020-10-29T03:47:22.374857-0400

    But osd.51 was still not marked as down after 54 reporters have
    reported
    that it is actually down.

    I checked, no ping or other traffic possible to osd.51. Host is
    unreachable.

    Another osd was marked as down, but it took a couple of minutes as well:

    2020-10-29T03:50:54.455-0400 7fbda185e700 10
    mon.CEPH2-MON1-206-U39@0(leader).osd e107102  osd.37 has 48 reporters,
    221.378970 grace (20.000000 + 201.379 + 6.34437e-23), max_failed_since
    2020-10-29T03:47:12.761584-0400
    2020-10-29T03:50:54.455-0400 7fbda185e700  1
    mon.CEPH2-MON1-206-U39@0(leader).osd e107102  we have enough reporters
    to mark osd.37 down

    In the end osd.51 was marked as down, but only after the MON decided to
    do so:

    2020-10-29T03:53:44.631-0400 7fbda185e700  0 log_channel(cluster) log
    [INF] : osd.51 marked down after no beacon for 903.943390 seconds
    2020-10-29T03:53:44.631-0400 7fbda185e700 -1
    mon.CEPH2-MON1-206-U39@0(leader).osd e107104 no beacon from osd.51
    since
    2020-10-29T03:38:40.689062-0400, 903.943390 seconds ago.  marking down

    I haven't seen this happen before in any cluster. It's also strange
    that
    this only happens in this rack, the other two racks work fine.

    ID    CLASS  WEIGHT      TYPE NAME
        -1         1545.35999  root default

    -206          515.12000      rack 206

        -7           27.94499          host CEPH2-206-U16
    ...
    -207          515.12000      rack 207

       -17           27.94499          host CEPH2-207-U16
    ...
    -208          515.12000      rack 208

       -31           27.94499          host CEPH2-208-U16
    ...

    That's how the CRUSHMap looks like. Straight forward and 3x replication
    over 3 racks.

    This issue only occurs in rack *207*.

    Has anybody seen this before or knows where to start?

    Wido
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx