Re: New CRUSH device class questions

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Wed, 7 Aug 2019 00:30:23 -0700

On Wed, Aug 7, 2019 at 12:08 AM Konstantin Shalygin <k0ste@xxxxxxxx> wrote:
On 8/7/19 1:40 PM, Robert LeBlanc wrote:

> Maybe it's the lateness of the day, but I'm not sure how to do that. 

> Do you have an example where all the OSDs are of class ssd?

Can't parse what you mean. You always should paste your `ceph osd tree` 

first.

Our 'ceph osd tree' is like this:
ID  CLASS WEIGHT    TYPE NAME                STATUS REWEIGHT PRI-AFF
 -1       892.21326 root default
 -3        69.16382     host sun-pcs01-osd01
  0   ssd   3.49309         osd.0                up  1.00000 1.00000
  1   ssd   3.42329         osd.1                up  0.87482 1.00000
  2   ssd   3.49309         osd.2                up  0.88989 1.00000
  3   ssd   3.42329         osd.3                up  0.94989 1.00000
  4   ssd   3.49309         osd.4                up  0.93993 1.00000
  5   ssd   3.42329         osd.5                up  1.00000 1.00000
  6   ssd   3.49309         osd.6                up  0.89490 1.00000
  7   ssd   3.42329         osd.7                up  1.00000 1.00000
  8   ssd   3.49309         osd.8                up  0.89482 1.00000
  9   ssd   3.42329         osd.9                up  1.00000 1.00000
100   ssd   3.49309         osd.100              up  1.00000 1.00000
101   ssd   3.42329         osd.101              up  1.00000 1.00000
102   ssd   3.49309         osd.102              up  1.00000 1.00000
103   ssd   3.42329         osd.103              up  0.81482 1.00000
104   ssd   3.49309         osd.104              up  0.87973 1.00000
105   ssd   3.42329         osd.105              up  0.86485 1.00000
106   ssd   3.49309         osd.106              up  0.79965 1.00000
107   ssd   3.42329         osd.107              up  1.00000 1.00000
108   ssd   3.49309         osd.108              up  1.00000 1.00000
109   ssd   3.42329         osd.109              up  1.00000 1.00000
 -5        62.24744     host sun-pcs01-osd02
 10   ssd   3.49309         osd.10               up  1.00000 1.00000
 11   ssd   3.42329         osd.11               up  0.72473 1.00000
 12   ssd   3.49309         osd.12               up  1.00000 1.00000
 13   ssd   3.42329         osd.13               up  0.78979 1.00000
 14   ssd   3.49309         osd.14               up  0.98961 1.00000
 15   ssd   3.42329         osd.15               up  1.00000 1.00000
 16   ssd   3.49309         osd.16               up  0.96495 1.00000
 17   ssd   3.42329         osd.17               up  0.94994 1.00000
 18   ssd   3.49309         osd.18               up  1.00000 1.00000
 19   ssd   3.42329         osd.19               up  0.80481 1.00000
110   ssd   3.49309         osd.110              up  0.97998 1.00000
111   ssd   3.42329         osd.111              up  1.00000 1.00000
112   ssd   3.49309         osd.112              up  1.00000 1.00000
113   ssd   3.42329         osd.113              up  0.72974 1.00000
116   ssd   3.49309         osd.116              up  0.91992 1.00000
117   ssd   3.42329         osd.117              up  0.96997 1.00000
118   ssd   3.49309         osd.118              up  0.93959 1.00000
119   ssd   3.42329         osd.119              up  0.94481 1.00000
... plus 11 more hosts just like this

How do you single out one OSD from each host for the metadata only and prevent data on that OSD when all the device classes are the same? It seems that you would need one OSD to be a different class to do that. It a previous email the conversation was:

Is it possible to add a new device class like 'metadata'?

Yes, but you don't need this. Just use your existing class with another crush ruleset.

So, I'm trying to figure out how you use the existing class of 'ssd' with another CRUSH ruleset to accomplish the above.

> Yes, we can set quotas to limit space usage (or number objects), but 

> you can not reserve some space that other pools can't use. The problem 

> is if we set a quota for the CephFS data pool to the equivalent of 95% 

> there are at least two scenario that make that quota useless.

Of course. 95% of CephFS deployments is where meta_pool on flash drives 

with enough space for this.

```

pool 21 'fs_data' replicated size 3 min_size 2 crush_rule 4 object_hash 

rjenkins pg_num 64 pgp_num 64 last_change 56870 flags hashpspool 

stripe_width 0 application cephfs

pool 22 'fs_meta' replicated size 3 min_size 2 crush_rule 0 object_hash 

rjenkins pg_num 16 pgp_num 16 last_change 56870 flags hashpspool 

stripe_width 0 application cephfs

```

```

# ceph osd crush rule dump replicated_racks_nvme

{

     "rule_id": 0,

     "rule_name": "replicated_racks_nvme",

     "ruleset": 0,

     "type": 1,

     "min_size": 1,

     "max_size": 10,

     "steps": [

         {

             "op": "take",

             "item": -44,

             "item_name": "default~nvme"    <------------

         },

         {

             "op": "chooseleaf_firstn",

             "num": 0,

             "type": "rack"

         },

         {

             "op": "emit"

         }

     ]

}

```

Yes, our HDD cluster is much like this, but not Luminous, so we created as separate root with SSD OSD for the metadata and set up a CRUSH rule for the metadata pool to be mapped to SSD. I understand that the CRUSH rule should have a `step take default class ssd` which I don't see in your rule unless the `~` in the item_name means device class.

Thanks
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com