Re: Cluster sync doesn't finsh

Samuel Just <sam.just@xxxxxxxxxxxxx> · Fri, 18 Nov 2011 17:05:57 -0800

I've inserted this bug as #1738.  Unfortunately, this will take a bit
of effort to fix.  In the short term, you could switch to a crushmap
where each node at the bottom level of the hierarchy contains more
than one device.  (i.e., remove the node level and stop at the rack
level).

Thanks for the help!
-Sam

On Fri, Nov 18, 2011 at 12:17 PM, Martin Mailand <martin@xxxxxxxxxxxx> wrote:
> Hi Sam,
>
> here the crushmap
>
> http://85.214.49.87/ceph/crushmap.txt
> http://85.214.49.87/ceph/crushmap
>
> -martin
>
> Samuel Just schrieb:
>>
>> It looks like a crushmap related problem.  Could you send us the crushmap?
>>
>> ceph osd getcrushmap
>>
>> Thanks
>> -Sam
>>
>> On Fri, Nov 18, 2011 at 10:13 AM, Gregory Farnum
>> <gregory.farnum@xxxxxxxxxxxxx> wrote:
>>>
>>> On Fri, Nov 18, 2011 at 10:05 AM, Tommi Virtanen
>>> <tommi.virtanen@xxxxxxxxxxxxx> wrote:
>>>>
>>>> On Thu, Nov 17, 2011 at 12:48, Martin Mailand <martin@xxxxxxxxxxxx>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>> I am doing cluster failure test, where I shut down one OSD an wait for
>>>>> the
>>>>> cluster to sync. But the sync never finshed, at around 4-5% it stops. I
>>>>> stoped osd2.
>>>>
>>>> ...
>>>>>
>>>>> 2011-11-17 16:42:45.520740    pg v1337: 600 pgs: 547 active+clean, 53
>>>>> active+clean+degraded; 113 GB data, 184 GB used, 1141 GB / 1395 GB
>>>>> avail;
>>>>> 4025/82404 degraded (4.884%)
>>>>
>>>> ...
>>>>>
>>>>> The osd log, the ceph.conf, pg dump, osd dump could be found here.
>>>>>
>>>>> http://85.214.49.87/ceph/
>>>>
>>>> This looks a bit worrying:
>>>>
>>>> 2011-11-17 17:56:35.771574 7f704c834700 -- 192.168.42.113:0/2424 >>
>>>> 192.168.42.114:6802/21115 pipe(0x2596c80 sd=17 pgs=0 cs=0 l=0).connect
>>>> claims to be 192.168.42.114:6802/21507 not 192.168.42.114:6802/21115 -
>>>> wrong node!
>>>>
>>>> So osd.0 is basically refusing to talk to one of the other OSDs. I
>>>> don't understand the messenger well enough to know why this would be,
>>>> but it wouldn't surprise me if this problem kept the objects degraded
>>>> -- it looks like a breakage in the osd<->osd communication.
>>>>
>>>> Now if this was the reason, I'd expect a restart of all the OSDs to
>>>> get it back in shape; messenger state is ephemeral. Can you confirm
>>>> that?
>>>
>>> Probably not — that wrong node thing can occur for a lot of different
>>> reasons, some of which matter and most of which don't. Sam's looking
>>> into the problem; there's something going wrong with the CRUSH
>>> calculations or the monitor PG placement overrides or something...
>>> -Greg
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html