Re: crashed+down+peering

Sage Weil <sage@xxxxxxxxxxxx> · Sat, 4 Dec 2010 16:53:28 -0800 (PST)

On Sat, 4 Dec 2010, Christian Brunner wrote:
> >> On Thu, 2 Dec 2010, Christian Brunner wrote:
> >>> We have simulated the simultanious crash of multiple osds in our
> >>> environment. After starting all the cosd again, we have the following
> >>> situation:
> >>>
> >>> 2010-12-02 16:18:33.944436    pg v724432: 3712 pgs: 1 active, 3605
> >>> active+clean, 1 crashed+peering, 46 down+peering, 56
> >>> crashed+down+peering, 3 active+clean+inconsistent; 177 GB data, 365 GB
> >>> used, 83437 GB / 83834 GB avail; 1/93704 degraded (0.001%)
> >>>
> >>> When I set of an "rbd rm" command for one of our rbd volumes, it seems
> >>> to hit the the "crashed+down+peering" pg. After that the command is
> >>> stuck.
> >>
> >> The pg isn't active, so any IO will hang until peering completes.  What
> >> version of the code are you running?  If it's something from unstable
> >> from the last couple of weeks it's probably related to problems there;
> >> please upgrade and restart the osds.  If it's the latest and greatest
> >> 'rc', we should look at the logs to see what's going on!
> >
> > We are running 0.23 - I will upgrade to the latest 'rc' tomorrow.
> 
> Upgrading to the latest rc version worked well. Everything is working
> again and all, except one pg are "active+clean". However there is one
> pg marked as "active+clean+inconsistent".
> 
> What can I do about an inconsistent group? I general a short
> description of the possible pg states would be helpful.

If a scrub detects an issue it marks the PG inconsistent.  THe idea is you 
can then do

 # ceph pg repair 1.123

to re-scrub and repair.  Make sure you have the latest 'rc' before you do 
that, though, as I just fixed an issue there yesterday.  And keep in mind 
the repair code is not well tested.  You may want to re-scrub first (ceph 
pg scrub 1.123) and watch the log (ceph -w) to see what the inconsistency 
actually is.

sage