Re: Scrub Errors

Blade <blade.doyle@xxxxxxxxx> · Fri, 6 May 2016 16:09:03 +0000 (UTC)

Oliver Dzombic <info@...> writes:

> 
> Hi Blade,
> 
> you can try to set the min_size to 1, to get it back online, and if/when
> the error vanish ( maybe after another repair command ) you can set the
> min_size again to 2.
> 
> you can try to simply out/down/?remove? the osd where it is on.
> 

Hi Oliver,

Thanks much for your suggestions!

So setting all pools min replication size down to 1 did get my cluster back
on-line.  I was then able to "repair" page 1.32.

However, there were still "139 scrub errors" which did not "repair".  Again,
I would issue a "pg repair" command but the OSD did not seem to get the
command.  

Then I restarted one of the OSD that had a page in an inconsistent state,
and again asked Ceph to repair the page, and this time it worked!!!  

So I wrote a quick for-loop to issue a repair command for each page:
for pg in $(ceph health detail | grep ^pg | awk '{print $2}'); do ceph pg
repair $pg; done

while running that and watching the OSD logs I saw that after the first few
repairs, the repair commands were not actually getting to the OSD owning the
pgs.  And again, restarting an OSD before sending a repair command fixed
that.  (Is it possible there is a queue of repair requests and if a repair
fails it blocks the queue?)  After many OSD restarts I finally repaired all
the pages.

So I am both very happy that my cluster is fixed now, but also very confused
about why the OSDs need to be restarted repeatedly for the repair commands
to run.

Now Im in the process of increasing the replication level back up.

Thanks again,
Blade.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com