Re: Cluster sync doesn't finsh

Martin Mailand <martin@xxxxxxxxxxxx> · Mon, 05 Dec 2011 13:44:10 +0100

Hi Sam,
is there anything new on this Issue, which I could test?

-martin

Am 19.11.2011 02:05, schrieb Samuel Just:
I've inserted this bug as #1738.  Unfortunately, this will take a bit
of effort to fix.  In the short term, you could switch to a crushmap
where each node at the bottom level of the hierarchy contains more
than one device.  (i.e., remove the node level and stop at the rack
level).

Thanks for the help!
-Sam

On Fri, Nov 18, 2011 at 12:17 PM, Martin Mailand<martin@xxxxxxxxxxxx>  wrote:
Hi Sam,

here the crushmap

http://85.214.49.87/ceph/crushmap.txt
http://85.214.49.87/ceph/crushmap

-martin

Samuel Just schrieb:

It looks like a crushmap related problem.  Could you send us the crushmap?

ceph osd getcrushmap

Thanks
-Sam

On Fri, Nov 18, 2011 at 10:13 AM, Gregory Farnum
<gregory.farnum@xxxxxxxxxxxxx>  wrote:

On Fri, Nov 18, 2011 at 10:05 AM, Tommi Virtanen
<tommi.virtanen@xxxxxxxxxxxxx>  wrote:

On Thu, Nov 17, 2011 at 12:48, Martin Mailand<martin@xxxxxxxxxxxx>
wrote:

Hi,
I am doing cluster failure test, where I shut down one OSD an wait for
the
cluster to sync. But the sync never finshed, at around 4-5% it stops. I
stoped osd2.

...

2011-11-17 16:42:45.520740    pg v1337: 600 pgs: 547 active+clean, 53
active+clean+degraded; 113 GB data, 184 GB used, 1141 GB / 1395 GB
avail;
4025/82404 degraded (4.884%)

...

The osd log, the ceph.conf, pg dump, osd dump could be found here.

http://85.214.49.87/ceph/

This looks a bit worrying:

2011-11-17 17:56:35.771574 7f704c834700 -- 192.168.42.113:0/2424>>
192.168.42.114:6802/21115 pipe(0x2596c80 sd=17 pgs=0 cs=0 l=0).connect
claims to be 192.168.42.114:6802/21507 not 192.168.42.114:6802/21115 -
wrong node!

So osd.0 is basically refusing to talk to one of the other OSDs. I
don't understand the messenger well enough to know why this would be,
but it wouldn't surprise me if this problem kept the objects degraded
-- it looks like a breakage in the osd<->osd communication.

Now if this was the reason, I'd expect a restart of all the OSDs to
get it back in shape; messenger state is ephemeral. Can you confirm
that?

Probably not — that wrong node thing can occur for a lot of different
reasons, some of which matter and most of which don't. Sam's looking
into the problem; there's something going wrong with the CRUSH
calculations or the monitor PG placement overrides or something...
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html