Satoru; Ok. What your cluster is telling you, then, is that it doesn't know which replica is the "most current" or "correct" replica. You will need to determine that, and let ceph know which one to use as the "good" replica. Unfortunately, I can't help you with this. In fact, if this is critical data, I'd seriously consider engaging a contractor to help you recover the data, and help your cluster return to a fully operational status. I have found it helpful to set noout, and norebalance, when I intend to reboot or offline any of my OSDs. It's also critical to allow the cluster to return to a cluster state of HEALTH_OK in between reboots. Thank you, Dominic L. Hilsbos, MBA Vice President – Information Technology Perform Air International Inc. DHilsbos@xxxxxxxxxxxxxx www.PerformAir.com From: Satoru Takeuchi [mailto:satoru.takeuchi@xxxxxxxxx] Sent: Friday, August 20, 2021 2:48 PM To: Dominic Hilsbos Cc: ceph-users Subject: Re: Re: The reason of recovery_unfound pg Hi Dominic, 2021年8月21日(土) 1:05 <DHilsbos@xxxxxxxxxxxxxx>: Satoru; You said " after restarting all nodes one by one." After each reboot, did you allow the cluster the time necessary to come back to a "HEALTH_OK" status? No, the we rebooted with the following policy. 1. Reboot one machine. 2. Wait until completing reboot as a Kubernetes level (not Ceph cluster level). 3. If there are other nodes to be rebooted, go to step 1. I should have explained this logic to you as well. I realized that above logic is wrong and I should wait coming back to HEALTH_OK. Unfortunately I doesn't understand the meaning of pg state well and there seem to be several states which mean "pg might be lost". https://docs.ceph.com/en/latest/rados/operations/pg-states/ Could you tell me that pg can become `recovery_unfoud` state in this case? Thanks, Satoru _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx