On Sat, 4 Dec 2010, Christian Brunner wrote: > >> On Thu, 2 Dec 2010, Christian Brunner wrote: > >>> We have simulated the simultanious crash of multiple osds in our > >>> environment. After starting all the cosd again, we have the following > >>> situation: > >>> > >>> 2010-12-02 16:18:33.944436 pg v724432: 3712 pgs: 1 active, 3605 > >>> active+clean, 1 crashed+peering, 46 down+peering, 56 > >>> crashed+down+peering, 3 active+clean+inconsistent; 177 GB data, 365 GB > >>> used, 83437 GB / 83834 GB avail; 1/93704 degraded (0.001%) > >>> > >>> When I set of an "rbd rm" command for one of our rbd volumes, it seems > >>> to hit the the "crashed+down+peering" pg. After that the command is > >>> stuck. > >> > >> The pg isn't active, so any IO will hang until peering completes. What > >> version of the code are you running? If it's something from unstable > >> from the last couple of weeks it's probably related to problems there; > >> please upgrade and restart the osds. If it's the latest and greatest > >> 'rc', we should look at the logs to see what's going on! > > > > We are running 0.23 - I will upgrade to the latest 'rc' tomorrow. > > Upgrading to the latest rc version worked well. Everything is working > again and all, except one pg are "active+clean". However there is one > pg marked as "active+clean+inconsistent". > > What can I do about an inconsistent group? I general a short > description of the possible pg states would be helpful. If a scrub detects an issue it marks the PG inconsistent. THe idea is you can then do # ceph pg repair 1.123 to re-scrub and repair. Make sure you have the latest 'rc' before you do that, though, as I just fixed an issue there yesterday. And keep in mind the repair code is not well tested. You may want to re-scrub first (ceph pg scrub 1.123) and watch the log (ceph -w) to see what the inconsistency actually is. sage