Dear list,
I'm currently maintaining several Ceph (prod) installations. One of them consists of 3 MON hosts and 6 OSD hosts hosting 40 OSDs in total. And there are 5 separate Proxmox-Hosts - they only host the VMs and use the storage provided by Ceph, but they are not part of Ceph.
The worst case happened: due to an outage, all these hosts crashed hardly the same time.
Last week, I began to restart (only the Ceph hosts; Proxmox servers are still down). Ceph was very unhappy with the situation as a whole - one OSD host (and its 6 OSDs) is completely gone, some hardware issues (33 OSDs left, networking, PSU, I'm working on it) and 73 out of 129 PGs inconsistent.
Meanwhile, the overall status of the cluster is "HEALTHY" again.
But nearly every day, one or two PGs get damaged. Never on the same OSDs. And there is no traffic on the storage as the virtualization hosts are not running. I see no further reason in the logs: everything is fine, scrub starts and leaves one or more PGs damaged. Repairing them is successful, but maybe next night, another PG is stuck.
Do you have hints to investigate this any further? I would love to understand more before starting the Proxmox cluster again. Using Ceph 18.2.4 (Proxmox packages).
Thanks a lot,
Marianne
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx