Hi Eugene, I recreated on hosts hvs001 and hvs003 the osd on lvm volumes with the exact same size. Now, everything seems great. The osd's on hvs002 are still larger but I see no complaints from ceph, so I leave it now like it is. Thanks for the help (especially from stopping me to try complex tweaks and probably making thinks worse 😉 ). Dominique. > -----Oorspronkelijk bericht----- > Van: Eugen Block <eblock@xxxxxx> > Verzonden: donderdag 7 april 2022 13:58 > Aan: Dominique Ramaekers <dominique.ramaekers@xxxxxxxxxx> > CC: ceph-users@xxxxxxx > Onderwerp: Re: Re: Ceph status HEALT_WARN - pgs problems > > The PGs are not activating because of the uneven OSD weights: > > > root@hvs001:/# ceph osd tree > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > > -1 1.70193 root default > > -3 0.00137 host hvs001 > > 0 hdd 0.00069 osd.0 up 1.00000 1.00000 > > 1 hdd 0.00069 osd.1 up 1.00000 1.00000 > > -5 1.69919 host hvs002 > > 2 hdd 0.84959 osd.2 up 1.00000 1.00000 > > 3 hdd 0.84959 osd.3 up 1.00000 1.00000 > > -7 0.00137 host hvs003 > > 4 hdd 0.00069 osd.4 up 1.00000 1.00000 > > 5 hdd 0.00069 osd.5 up 1.00000 1.00000 > > This means ceph assigns (almost) all PGs to OSDs 2 and 3, you should try to > create OSDs of the same size, especially in such a small environment. You > could play around with crush weights but I don't think that's the best > approach. If possible, recreate the OSDs so they are kind of same sized. > > Zitat von Dominique Ramaekers <dominique.ramaekers@xxxxxxxxxx>: > > > Hi Eugen, > > > > You say I don't have to worry about changing pg_num manualy. Makes > > sense. Does this also count for pg_num_max? Will the pg_autoscaler > > also change this parameter if nescessary? > > > > Below the output you requested > > > > root@hvs001:/# ceph -s > > cluster: > > id: dd4b0610-b4d2-11ec-bb58-d1b32ae31585 > > health: HEALTH_WARN > > Reduced data availability: 64 pgs inactive > > Degraded data redundancy: 68 pgs undersized > > > > services: > > mon: 3 daemons, quorum hvs001,hvs002,hvs003 (age 43h) > > mgr: hvs001.baejuo(active, since 43h), standbys: hvs002.etijdk > > osd: 6 osds: 6 up (since 25h), 6 in (since 4h); 4 remapped pgs > > > > data: > > pools: 2 pools, 68 pgs > > objects: 2 objects, 705 KiB > > usage: 134 MiB used, 1.7 TiB / 1.7 TiB avail > > pgs: 94.118% pgs not active > > 4/6 objects misplaced (66.667%) > > 64 undersized+peered > > 4 active+undersized+remapped > > > > progress: > > Global Recovery Event (0s) > > [............................] > > > > root@hvs001:/# ceph osd tree > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > > -1 1.70193 root default > > -3 0.00137 host hvs001 > > 0 hdd 0.00069 osd.0 up 1.00000 1.00000 > > 1 hdd 0.00069 osd.1 up 1.00000 1.00000 > > -5 1.69919 host hvs002 > > 2 hdd 0.84959 osd.2 up 1.00000 1.00000 > > 3 hdd 0.84959 osd.3 up 1.00000 1.00000 > > -7 0.00137 host hvs003 > > 4 hdd 0.00069 osd.4 up 1.00000 1.00000 > > 5 hdd 0.00069 osd.5 up 1.00000 1.00000 > > > > root@hvs001:/# ceph osd pool ls detail pool 1 '.mgr' replicated size 3 > > min_size 2 crush_rule 0 object_hash rjenkins pg_num 4 pgp_num 2 > > pg_num_target 1 pgp_num_target 1 autoscale_mode on last_change 76 > lfor > > 0/0/64 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 > > application mgr pool 2 'libvirt-pool' replicated size 3 min_size 2 > > crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 1 pgp_num_target > > 64 autoscale_mode on last_change 73 lfor 0/0/73 flags hashpspool > > stripe_width 0 pg_num_max 128 application rbd > > > > root@hvs001:/# ceph osd crush rule dump [ > > { > > "rule_id": 0, > > "rule_name": "replicated_rule", > > "type": 1, > > "steps": [ > > { > > "op": "take", > > "item": -1, > > "item_name": "default" > > }, > > { > > "op": "chooseleaf_firstn", > > "num": 0, > > "type": "host" > > }, > > { > > "op": "emit" > > } > > ] > > } > > ] > > > > root@hvs001:~# tail /var/log/syslog > > Apr 7 11:21:12 hvs001 bash[2670]: debug > > 2022-04-07T11:21:12.719+0000 7f51a7008700 0 log_channel(cluster) log > > [DBG] : pgmap v79196: 68 pgs: 64 undersized+peered, 4 > > active+undersized+remapped; 705 KiB data, 134 MiB used, 1.7 TiB / > > 1.7 TiB avail; 4/6 objects misplaced (66.667%) Apr 7 11:21:13 hvs001 > > bash[2673]: level=error ts=2022-04-07T11:21:13.360Z > > caller=notify.go:372 component=dispatcher msg="Error on notify" > > err="Post > > https://10.3.1.23:8443//api/prometheus_receiver: x509: cannot validate > > certificate for 10.3.1.23 because it doesn't contain any IP SANs" > > context_err="context deadline exceeded" > > Apr 7 11:21:13 hvs001 bash[2673]: level=error > > ts=2022-04-07T11:21:13.360Z caller=notify.go:372 component=dispatcher > > msg="Error on notify" err="Post > > https://hvs002.cometal.be:8443/api/prometheus_receiver: x509: > > certificate is valid for ceph-dashboard, not hvs002.cometal.be" > > context_err="context deadline exceeded" > > Apr 7 11:21:13 hvs001 bash[2673]: level=error > > ts=2022-04-07T11:21:13.361Z caller=dispatch.go:301 > > component=dispatcher msg="Notify for alerts failed" num_alerts=1 > > err="Post https://10.3.1.23:8443//api/prometheus_receiver: x509: > > cannot validate certificate for 10.3.1.23 because it doesn't contain > > any IP SANs; Post > > https://hvs002.cometal.be:8443/api/prometheus_receiver: x509: > > certificate is valid for ceph-dashboard, not hvs002.cometal.be" > > Apr 7 11:21:14 hvs001 bash[2668]: cluster > > 2022-04-07T11:21:12.722206+0000 mgr.hvs001.baejuo (mgr.64107) 79190 > > : cluster [DBG] pgmap v79196: 68 pgs: 64 undersized+peered, 4 > > active+undersized+remapped; 705 KiB data, 134 MiB used, 1.7 TiB / > > 1.7 TiB avail; 4/6 objects misplaced (66.667%) Apr 7 11:21:14 hvs001 > > bash[2670]: debug > > 2022-04-07T11:21:14.719+0000 7f51a7008700 0 log_channel(cluster) log > > [DBG] : pgmap v79197: 68 pgs: 64 undersized+peered, 4 > > active+undersized+remapped; 705 KiB data, 134 MiB used, 1.7 TiB / > > 1.7 TiB avail; 4/6 objects misplaced (66.667%) Apr 7 11:21:15 hvs001 > > bash[2668]: cluster > > 2022-04-07T11:21:14.723411+0000 mgr.hvs001.baejuo (mgr.64107) 79191 > > : cluster [DBG] pgmap v79197: 68 pgs: 64 undersized+peered, 4 > > active+undersized+remapped; 705 KiB data, 134 MiB used, 1.7 TiB / > > 1.7 TiB avail; 4/6 objects misplaced (66.667%) Apr 7 11:21:16 hvs001 > > bash[2668]: debug > > 2022-04-07T11:21:16.199+0000 7fc12252e700 1 mon.hvs001@0(leader).osd > > e87 _set_new_cache_sizes > > cache_size:1020054731 inc_alloc: 348127232 full_alloc: 348127232 > > kv_alloc: 322961408 > > Apr 7 11:21:16 hvs001 bash[2670]: ::ffff:10.3.1.23 - - > > [07/Apr/2022:11:21:16] "GET /metrics HTTP/1.1" 200 166748 "" > > "Prometheus/2.18.1" > > Apr 7 11:21:16 hvs001 bash[2670]: debug > > 2022-04-07T11:21:16.267+0000 7f514f9b2700 0 [prometheus INFO > > cherrypy.access.139987709758544] ::ffff:10.3.1.23 - - > > [07/Apr/2022:11:21:16] "GET /metrics HTTP/1.1" 200 166748 "" > > "Prometheus/2.18.1" > > > > ________________________________________ > > Van: Eugen Block <eblock@xxxxxx> > > Verzonden: donderdag 7 april 2022 12:49 > > Aan: ceph-users@xxxxxxx > > Onderwerp: Re: Ceph status HEALT_WARN - pgs problems > > > > Hi, > > > > please add some more output, e.g. > > > > ceph -s > > ceph osd tree > > ceph osd pool ls detail > > ceph osd crush rule dump (of the used rulesets) > > > > You have the pg_autoscaler enabled, you don't need to deal with pg_num > > manually. > > > > > > Zitat von Dominique Ramaekers <dominique.ramaekers@xxxxxxxxxx>: > > > >> Hi, > >> > >> My cluster is up and running. I saw a note in ceph status that 1 pg > >> was undersized. I read about the amount of pgs and the recommended > >> value (OSD's*100/poolsize => 6*100/3 = 200). The pg_num should be > >> raised carfully, so I raised it to 2 and ceph status was fine again. > >> So I left it like it was. > >> > >> Than I created a new pool: libvirt-pool. > >> > >> Now ceph status is again in warning regarding pgs. I raised > >> pg_num_max of the libvirt_pool to 265 and pg_num to 128. > >> > >> Ceph status stays in warning. > >> root@hvs001:/# ceph status > >> ... > >> health: HEALTH_WARN > >> Reduced data availability: 64 pgs inactive > >> Degraded data redundancy: 68 pgs undersized ... > >> pgs: 94.118% pgs not active > >> 4/6 objects misplaced (66.667%) -This is there from the > >> beginning of the creation of the cluster- > >> 64 undersized+peered > >> 4 active+undersized+remapped > >> > >> I also get a progress: global Recovery Event (0s) which only go's > >> away with 'ceph progress clear' > >> > >> My autoscale-status is the following: > >> root@hvs001:/# ceph osd pool autoscale-status > >> POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO > >> TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM > AUTOSCALE > >> BULK > >> .mgr 576.5k 3.0 1743G 0.0000 > >> 1.0 1 on False > >> libvirt-pool 0 3.0 1743G 0.0000 > >> 1.0 64 on False > >> > >> (It's a 3 node cluster with 2 OSD's per node.) > >> > >> The documentation doesn't help me much here. What should I do? > >> > >> Greetings, > >> > >> Dominique. > >> > >> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an > >> email to ceph-users-leave@xxxxxxx > > > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an > > email to ceph-users-leave@xxxxxxx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx