Re: CEPH Cluster performance review

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Sat, 11 Nov 2023 23:53:41 -0500

I'm going to assume that ALL of your pools are replicated with size 3, since you didn't provide that info, and that all but the *hdd pools are on SSDs.

`ceph osd dump | grep pool`

Let me know if that isn't the case.

With that assumption, I make your pg ratio to be ~ 57, which is way too low.

Run `ceph osd df` and look at the next to the last column, that's the number of PG shards/replicates on each OSD.  In my example below, there are 145

Check for both HDD and SSD OSDs. 

[rook@rook-ceph-tools-5ff8d58445-gkl5w ~]$ ceph osd df | tail
433    hdd  7.27739   1.00000  7.3 TiB  344 MiB  250 MiB       0 B    94 MiB  7.3 TiB  0.00  0.04  145      up
462    hdd  7.27739   1.00000  7.3 TiB  344 MiB  250 MiB       0 B    95 MiB  7.3 TiB  0.00  0.04  145      up

Current upstream guidance is to target ~100 PG shards/replicas on each OSD, to protect against running out of RAM.  Some people including me find larger numbers to be appropriate, depending on the media.  ymmv.  Free advice and worth every penny.  I personally shoot for ~~ 150-200 per HDD OSD, 200-300 for each SSD.

I would suggest the below.  You don't mention RAM, though.  Assuming that your OSDs are all BlueStore, I would suggest at least 64GB on the HDD nodes and 96GB on the SSD nodes *for the OSD processes*.  If you have VMs or mons/mgrs or other significant compute on the OSD nodes, they'll naturally need extra.  When you increase pg_num for a given pool, you'll increase the RAM that the OSDs hosting that pool use, especially during topology changes or startup.  So if your RAM is marginal now, increasing pg_num could lead to oomkilling.  Trust me, that's something best avoided.

If you still have FileStore OSDs, I strongly suggest redeploying them one at a time as BlueStore.  

> Additionally, some OSDs fail during the scrubbing process. In such
> instances, promptly halting the scrubbing resolves the issue.

Have you looked for drive errors?  e.g. `dmesg`

> I intend to enlarge the PG size for the "one-ssd" configuration.

Not a bad idea, see below.

> Please provide the PG number, and suggest the optimal approach to increase the PG
> size without causing any slowdown or service disruptions to the VMs.

With releases beginning with Nautilus, this became a MUCH easier task than it used to be.

> 
>            nodeep-scrub flag(s) set
>            656 pgs not deep-scrubbed in time

If you have osd_scrub_begin_hour / osd_scrub_end_hour (or begin/end day) set, that could contribute to this.  IMHO one should be able to scrub around the clock, unless your workload varies considerably throughout the day.
Assume that you aren't setting thouse, you might want to double the setting of osd_deep_scrub_interval. The default is  7 days, I might suggest 14 for your HDD OSDs, assuming those are the ones that are not being scrubbed in time.

> 
>    osd:     107 osds: 105 up (since 3h), 105 in (since 3d)

You have two OSDs down btw.

>  data:
>    pools:   13 pools, 2057 pgs
>    objects: 9.40M objects, 35 TiB
>    usage:   106 TiB used, 154 TiB / 259 TiB avail
>    pgs:     2057 active+clean
> 
> root@ceph1:~# ceph df
> --- RAW STORAGE ---
> CLASS  SIZE     AVAIL    USED     RAW USED  %RAW USED
> hdd    151 TiB   78 TiB   72 TiB    73 TiB      48.04
> ssd    110 TiB   78 TiB   32 TiB    32 TiB      29.42
> TOTAL  261 TiB  156 TiB  104 TiB   105 TiB      40.19

One often aims for the percentages of the total PG count for each pool to be relative to the percentage of the cluster's data that each holds.  

https://old.ceph.com/pgcalc/
^ this isn't working for me the the moment

Right now your RGW buckets.data pool is using half the space of the one-ssd pool, but it has twice the PGs.

So you'd want to invoke:

ceph osd pool set one-ssd pg_num 512

If you want to gauge the impact, you might set to, say, 260 first and see how it goes, but most likely you could set it to 512 and the cluster will gradually step it up a few at a time.  During that time you'll see some data moving around.  It is ideal for each pool to end up with pg_num set to a power of 2 but you can have in-between numbers along the way for a short time.

If I make correct assumptions and calculations above, your HDD OSDs have a PG ratio of ~~74.  If you were to bump each of the HDD pools from 512 to 1024 PGs, your ratio would end up roughly at 150, which IMHO is desirable -- assuming you have RAM.  But address your SSD pools first, they're suffering more.

If I make correct assumptions and calculations above, your SSD OSDs have a PG ratio of ~~48 which is waaay too low.  I suggest setting one-ssd to 512 which should result in your ration on those OSDs rising to ~~ 64.  If that holds true given my interpretation of what you supply, I might then set the index and log pools to 64 -- you want those to be >= the number of OSDs they're on, which is not currently true.  Then choose whichever of buckets.data and one-ssd is used more heavily, and set it to 1024.  Let the cluster settle, check the PG ratios again and I think you'll be around ~~100 on the SSD OSDs, at which point I might then set the other SSD pool to 1024.

Do these one at a time, in however many increments you're comfortable with.  

> 
> --- POOLS ---
> POOL                        ID  PGS  STORED   OBJECTS  USED     %USED  MAX
> AVAIL
> cephfs_data                  1   64  3.8 KiB        0   11 KiB      0 23 TiB
> cephfs_metadata              2    8  228 MiB       79  685 MiB      0 23 TiB
> .rgw.root                    3   32  6.0 KiB        8  1.5 MiB      0 23 TiB
> default.rgw.control          4   32      0 B        8      0 B      0 23 TiB
> default.rgw.meta             5   32   12 KiB       48  7.5 MiB      0 23 TiB
> default.rgw.log              6   32  4.8 KiB      207  6.0 MiB      0 23 TiB
> default.rgw.buckets.index    7   32  410 MiB       15  1.2 GiB      0 23 TiB
> default.rgw.buckets.data     8  512  4.6 TiB    1.29M   14 TiB  16.59 23 TiB
> default.rgw.buckets.non-ec   9   32  1.0 MiB      676  130 MiB      0 23 TiB
> one-hdd                     10  512  9.2 TiB    2.45M   28 TiB  28.69 23 TiB
> device_health_metrics       11    1  9.5 MiB      113   28 MiB      0 23 TiB
> one-ssd                     12  256   11 TiB    2.88M   32 TiB  31.37 23 TiB
> cloudstack.hdd              15  512   10 TiB    2.72M   31 TiB  30.94 23 TiB
> 
> 
> 
> Regards
> Mosharaf Hossain
> Manager, Product Development
> IT Division
> 
> Bangladesh Export Import Company Ltd.
> 
> Level-8, SAM Tower, Plot #4, Road #22, Gulshan-1, Dhaka-1212,Bangladesh
> 
> Tel: +880 9609 000 999, +880 2 5881 5559, Ext: 14191, Fax: +880 2 9895757
> 
> Cell: +8801787680828, Email: mosharaf.hossain@xxxxxxxxxxxxxx, Web:
> www.bol-online.com
> <https://www.google.com/url?q=http://www.bol-online.com&sa=D&source=hangouts&ust=1557908951423000&usg=AFQjCNGMxIuHSHsD3qO6y5JddpEZ0S592A>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx