I am wondering if anyone has experience with the mark_unfound_lost delete command seemingly not doing what it is supposed to, or if perhaps I have unreasonable expectations about its function. We have a EC pool making up a rgw data pool, and we have had a data loss scenario. I've attempted to manually recover shards from all listed peers in might_have_unfound, and had some success, but after extensive searching, I believe the time has come to let go of the data we are still missing in hopes of getting the cluster back to healthy and restoring service functionality. When I run "ceph pg 21.258e mark_unfound_lost delete", the command runs for some time, until a few minutes in the primary OSD drops out of the cluster but is still running. The logs would suggest this is because it is doing some intensive iterative operations and is unresponsive to other OSDs. Given we have tens of thousands of objects being marked lost, it would make sense this might take some time... but in the meantime, the OSD is marked out, another OSD takes its place, and the number of unfound objects for the PG increases over the next few hours back to the original amount. It seems so far, the primary OSD has not come back in every time I've tried this operation. My initial reaction was to restart the OSD when it dropped from the cluster (and its PG went DOWN state) in an attempt to keep the RGW functioning, but I realize that could have been counterproductive once I observed the logs of the primary iterating over objects. Yet even leaving the OSD to complete the iterative process, it doesn't seem to rejoin cluster without an intervention in the form of daemon restart. I'm wondering if anyone has experience deleting unfound objects at this scale, and if it is an asynchronous operation that eventually completes, or if we are encountering some unexpected behavior that warrants a bug report? I am also wondering if ceph-objectstore-tool might be employed to work on all shards of the PG at once and just start them back up together, minus the unfound objects? I haven't seen much useful documented use of the "fix-lost" operation, so I have hesitated to try it without a full understanding of what it does. Thank you to anyone who might be able to provide some information. -- Brian Andrus | Cloud Systems Engineer | DreamHost brian.andrus@xxxxxxxxxxxxx | www.dreamhost.com _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx