Cluster problem - Quncy

Deep Dish <deeepdish@xxxxxxxxx> · Tue, 20 Dec 2022 13:21:30 -0500

Hello.

I have a few issues with my ceph cluster:

- RGWs have disappeared from management (console does not register any
RGWs) despite showing 4 services deployed and processes running;

- All object buckets not accessible / manageable;

- Console showing some of my pools are “updating” – its been like this for
a few days;

What was done:

- Expanded a 30 OSD / 3 node cluster to 60 OSDs across 6 nodes;

- Changed the failure domain of our crush rules from OSD to host;

- Increased PG and PGP of main data pools to reflect additional cluster
OSDs and maintain a ~ 100 PG / OSD target;

- Attempted to redeploy / re-create service for RGWs / remove SSL config;

- A single OSD with very high ECC errors was found and preemptively removed
from the cluster.

The cluster took a few days rebalancing itself, and impression was it would
be done by now.   It’s not healthy, and as per above, RGWs are no longer
manageable – not sure where to start troubleshooting this as I’ve never
encountered such a scenario before.

Cluster specs:

-       6 OSD nodes (10 OSDs each);

-       5 Monitors;

-       2 Managers;

-       5 MDS;

-       4 RGWs;

-       Quincy 17.2.5;

-
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx