Hi! There seems to be an issue that an OSD is not queried for missing object ec-parts that were remapped, but the OSD for this is up. This happened in two different scenarios for us. In both, data is stored in EC pools (8+3). Scenario 0 To remove a broken disk (e.g. osd.22), it is weighted to 0 with ceph osd out 22. Objects are remapped normally. During object movement, osd.22 is restarted (or crashes and then starts again). Now the bug shows up: Objects will become degraded and stay degraded, because osd.22 is not queried, but it it up and running. ceph pq query shows: "might_have_unfound": [ { "osd": "22(3)", "status": "not queried" } ], A workaround is to in the broken-disk osd temporarily. The osd is then queried and missing object ec-parts are discovered. Then, out the osd again. No objects are degraded any more and disk will be emptied. Scenario 1 Add new disks to the cluster. Data is remapped to be transferred from the old disks (e.g. osd.19) to new disks (e.g. > osd.42). When there is a restart an OSD of the old disks (or it restarts because of a crash), objects become degraded. The missing object ec-part-data is on the osd.19 but again it is not queried. ceph pg query shows: "might_have_unfound": [ { "osd": "19(6)", "status": "not queried" } ], Only remapped data seems to be undiscovered: If osd.19 is taken down, much more data is degraded. Mind that osd.19 is missing in the acting set in the current state of this PG: "up": [38, 36, 28, 17, 13, 39, 48, 10, 29, 5, 47], "acting": [36, 15, 28, 17, 13, 32, 2147483647, 10, 29, 5, 20], "backfill_targets": [ "36(1)", "38(0)", "39(5)", "47(10)", "48(6)" ], "acting_recovery_backfill": [ "5(9)", "10(7)", "13(4)", "15(1)", "17(3)", "20(10)", "28(2)", "29(8)", "32(5)", "36(0)", "36(1)", "38(0)", "39(5)", "47(10)", "48(6)" ], For this scenario, I have not found a workaround yet. The cluster remains degraded until it has recovered by restoring the data. So, overall I suspect there is a bug which prevents remapped pg data to be discovered. The PG already knows which OSD is the correct candidate, but does not query it. I can try fixing this myself, but I'd need some hints from the developers to relevant code parts. The OSD is stored correctly in pg->might_have_unfound, and I think it should be queried in PG::discover_all_missing, but I'm lost there. I'd appreciate any help tracking this down. -- Jonas