Re: Problems after crash yesterday

Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> · Thu, 23 Feb 2012 21:14:31 -0800

On Wed, Feb 22, 2012 at 12:25 PM, Jens Rehpöhler
<jens.rehpoehler@xxxxxxxx> wrote:
> Hi Gregory,
>
>
> On 22.02.2012 18:12, Gregory Farnum wrote:
>> On Feb 22, 2012, at 1:53 AM, "Jens Rehpöhler" <jens.rehpoehler@xxxxxxxx> wrote:
>>
>>> Some Additios: meanwhile we are at the state:
>>>
>>> 2012-02-22 10:38:49.587403    pg v1044553: 2046 pgs: 2036 active+clean,
>>> 10 active+clean+inconsistent; 2110 GB data, 4061 GB used, 25732 GB /
>>> 29794 GB avail
>>>
>>> The  active+recovering+remapped+backfill disappeared auf a restart of a
>>> cashed OSD.
>>>
>>> The OSD crashed after issuing the command "ceph pg repair 106.3".
>>>
>>> The repeating message is also there:
>> Hmm. These messages indicate there are requests that came in that
>> never got answered -- or else that the tracking code isn't quite right
>> (it's new functionality). What version are you running?
> We use:
>
> root@fcmsnode0:~# ceph -v
> ceph version 0.42-62-gd6de0bb
> (commit:d6de0bb83bcac238b3a6a376915e06fb7129b2c8)
>
> Kernel is 3.2.1
>
> i accidently updated one of our OSDs to 0.42 -> So we updated the whole
> cluster.
>
> The OSD repeated to crash while issuing  "repair" command. The
> inconsistent PGs
> are all on the same (newly added) node.

Oh, that's interesting. Are all the other nodes in the cluster up and in?

In the next version or two we will have a lot more capability to look
into what's happening with stuck PGs like this, but for the moment we
need a log. If all the other nodes in the system are up, can you
restart this new OSD with "debug osd = 20" and "debug ms = 1" added to
its config?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html