Oliver Dzombic <info@...> writes: > > Hi Blade, > > you can try to set the min_size to 1, to get it back online, and if/when > the error vanish ( maybe after another repair command ) you can set the > min_size again to 2. > > you can try to simply out/down/?remove? the osd where it is on. > Hi Oliver, Thanks much for your suggestions! So setting all pools min replication size down to 1 did get my cluster back on-line. I was then able to "repair" page 1.32. However, there were still "139 scrub errors" which did not "repair". Again, I would issue a "pg repair" command but the OSD did not seem to get the command. Then I restarted one of the OSD that had a page in an inconsistent state, and again asked Ceph to repair the page, and this time it worked!!! So I wrote a quick for-loop to issue a repair command for each page: for pg in $(ceph health detail | grep ^pg | awk '{print $2}'); do ceph pg repair $pg; done while running that and watching the OSD logs I saw that after the first few repairs, the repair commands were not actually getting to the OSD owning the pgs. And again, restarting an OSD before sending a repair command fixed that. (Is it possible there is a queue of repair requests and if a repair fails it blocks the queue?) After many OSD restarts I finally repaired all the pages. So I am both very happy that my cluster is fixed now, but also very confused about why the OSDs need to be restarted repeatedly for the repair commands to run. Now Im in the process of increasing the replication level back up. Thanks again, Blade. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com