Hi,
I am trying to gather experience on a Ceph STAGE cluster; it consists of virtual machines - which is not perfect, I know. The VMs are running Debian 12 and podman-4.3.1. There is practically no load on this Ceph - there is just one client using the storage, and it makes no noise. So this is what happened:
- "During data consistency checks (scrub), at least one PG has been flagged as being damaged or inconsistent."
- so I listed them (["2.3","2.58"])
- and tried to repair ("ceph pg repair 2.3", "ceph pg repair 2.58")
- they both went well (resulting in "pgs: 129 active+clean"), but the cluster keeped its "HEALTH_WARN" state ("Too many repaired reads on 1 OSDs")
- so I googled for this message; and the only thing I found was to restart the OSD to get rid of this message and - more important - the cluster WARN state ("ceph orch daemon restart osd.3")
- after the restart, my cluster was still in WARN state - and complained about "2 PGs has been flagged as being damaged or inconsistent" - but other PGs on other OSDs
- I "ceph pg repair"ed them, too, and the cluster's state was WARN afterwards, again ("Too many repaired reads on 1 OSDs")
- when I restarted the OSD ("ceph orch daemon restart osd.2"), the crash occured; Ceph marked this OSD "down" and "out" and suspected a hardware issue, while the OSD HDDs in fact are QEMU "harddisks"
- I can't judge whether it's a serious bug or just due to my non-optimal STAGE setup, so I'll attach the gzipped log of osd.2
I need help to understand what happened and how to prevent it in the future. What ist this "Too many repaired reads" and how to deal with it?
Thanks a lot for reading,
Marianne
Attachment:
ceph-osd.2.log.gz
Description: GNU Zip compressed data
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx