ceph crush reweight #osd 0 strange redistribution

"Marcel Kuiper" <ceph@xxxxxxxx> · Wed, 22 Jul 2020 11:44:41 +0200

Hi List

This is a rephrase of an earlier question that puzzels me. I took out a
disk on nautilus 14.2.8 with 'ceph crush reweight osd.111 0'. I expected
that PGs would mainly go to osds on other nodes but to my surprise most
PG's ended up on osds on the same node. Here is an overview of the number
of PG's that got send to an other osd. The node that had osd 111 has osds
108-116 as well

NR PG's   OSD_ID
      1   61
      1   83
      1   86
      3   108
      4   109
      5   110
      2   112
      5   113
      7   114
      5   115
      2   116

As you can see only 3 PG's were send to other node's osd's. The rest was
send to osds on the same node :-O. There are 81 osds in the same room as
osd.111. I would have expected (certainly since node weight decreases with
crush reweight) setting to 0 that PG's would mostly go to other nodes. Can
someone explain this behaviour?

All pools make use of crush_rule 1, min_size = 2, size = 3 with the
following rules:

  "rule_id": 1,
  "rule_name": "hdd",
  "ruleset": 1,
  "type": 1,
  "min_size": 2,
  "max_size": 3,
  "steps": [
      {
          "op": "take",
          "item": -31,
          "item_name": "DC3"
      },
      {
          "op": "choose_firstn",
          "num": 0,
          "type": "room"
      },
      {
          "op": "chooseleaf_firstn",
          "num": 1,
          "type": "host"
      },
  ]

I cannot attache the complete osd tree but the structure is:

ID  CLASS WEIGHT     TYPE NAME                         STATUS REWEIGHT
PRI-AFF
-31       1275.35522 root DC3
-32        600.69037     room az-a
-39        133.14589         rack rack_W26
-27         66.57289             host st3g3psm2
108   hdd    7.39699                 osd.108               up  1.00000
1.00000
109   hdd    7.39699                 osd.109               up  1.00000
1.00000
110   hdd    7.39699                 osd.110               up  1.00000
1.00000
111   hdd    7.39699                 osd.111               up  1.00000
1.00000
112   hdd    7.39699                 osd.112               up  1.00000
1.00000
113   hdd    7.39699                 osd.113               up  1.00000
1.00000
114   hdd    7.39699                 osd.114               up  1.00000
1.00000
115   hdd    7.39699                 osd.115               up  1.00000
1.00000
116   hdd    7.39699                 osd.116               up  1.00000
1.00000

with 3 rooms, 13 racks, 19 hosts, 172 osds, 3968 PG's, 10 pools. As you
see there are racks in the tree but these are not taken into consideration
in the crush rule 1. The cluster is not yet fully in use hence the PG to
OSD ratio is still low. We expect more pools to be added in the near
future

Marcel
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx