MDS spinning wild after restart on all nodes

Amon Ott <a.ott@xxxxxxxxxxxx> · Fri, 15 Jun 2012 11:43:25 +0200

Hello all,

I have seen this for a long time, but never investigated further. After stable 
test runs for several days, this is our last known show stopper before using 
Ceph in production. We are running 0.47.2 on 32 Bit.

If we restart MDS (or all ceph daemons) on all nodes, one after another or all 
together, they first recover and then the active one starts to spin with full 
cpu and does not answer any more. After a while, the next takes over, starts 
to spin, etc., until the whole cluster is unusable. This is completely 
reproducable and happens even without any active client.

As ecpected, ceph -w shows lots of
"2012-06-15 11:35:28.588775   mds e959: 1/1/1 up {0=3=up:active(laggy or 
crashed)}"

It does not help to stop all services on all nodes for minutes or longer and 
to restart them - MDS will restart spinning. But: If we reboot the whole 
cluster, everything goes back to work.

Today's MDS log is available at 
https://download.m-privacy.de/homeuser-mds.0.log.gz

Is this a known problem? It has been with us for a looong time now, but since 
rebooting used to help, we never tracked it down.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH           Tel: +49 30 24342334
Am Köllnischen Park 1    Fax: +49 30 24342336
10179 Berlin             http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html