Ceph crash, how to analyse and recover

"Ammerlaan, A.J.G." <A.J.G.Ammerlaan@xxxxxxxxxxxxx> · Wed, 25 May 2016 06:43:05 +0000

Hello Ceph Users,

We have a Ceph test cluster, that we want to bring into production and will grow rapidly in the future.

Ceph version:

ceph                                   0.80.7-2+deb8u1             amd64        distributed storage and file system

ceph-common                    0.80.7-2+deb8u1             amd64        common utilities to mount and interact with a ceph storage cluster

Our config:

5 hosts with each running 12 OSDs

containing 2 objects

One node went down and stayed down for about 12 hours

Then it was brought back online (manually), the entire cluster slowly 

came to a halt with the current status being:

First status after this crash:

cluster e2295d66-a265-11e5-8c92-00219bfd424c

      health HEALTH_WARN 4628 pgs down; 4628 pgs peering; 4628 pgs stuck 

inactive; 4628 pgs stuck unclean

      monmap e3: 3 mons at 

{a=172.30.0.2:6789/0,b=172.30.0.67:6789/0,mon=172.30.0.1:6789/0}, 

election epoch 16, quorum 0,1,2 mon,a,b

      osdmap e18880: 60 osds: 48 up, 48 in

       pgmap v127495: 4628 pgs, 4 pools, 1238 bytes data, 4 objects

             283 GB used, 130 TB / 130 TB avail

                 4628 down+peering

The Ceph status at this moment:

# ceph status

    cluster e2295d66-a265-11e5-8c92-00219bfd424c

     health HEALTH_WARN 4622 pgs down; 4628 pgs peering; 1427 pgs stale; 4628 pgs stuck inactive; 1427 pgs stuck stale; 4628 pgs stuck unclean; 2/17 in osds are down; 1 mons down, quorum 1,2 a,b

     monmap e3: 3 mons at {a=172.30.0.2:6789/0,b=172.30.0.67:6789/0,mon=172.30.0.1:6789/0}, election epoch 18, quorum 1,2 a,b

     osdmap e19242: 60 osds: 15 up, 17 in

      pgmap v128135: 4628 pgs, 4 pools, 118 bytes data, 3 objects

            100 GB used, 47383 GB / 47483 GB avail

                   3 peering

                1424 stale+down+peering

                3198 down+peering

                   3 stale+peering

It is a test cluster, so no real harm done. How to get it back up, and 

why did this happen?

Regards, Arnoud.

De informatie opgenomen in dit bericht kan vertrouwelijk zijn en is
uitsluitend bestemd voor de geadresseerde. Indien u dit bericht onterecht
ontvangt, wordt u verzocht de inhoud niet te gebruiken en de afzender direct
te informeren door het bericht te retourneren. Het Universitair Medisch
Centrum Utrecht is een publiekrechtelijke rechtspersoon in de zin van de W.H.W.
(Wet Hoger Onderwijs en Wetenschappelijk Onderzoek) en staat geregistreerd bij
de Kamer van Koophandel voor Midden-Nederland onder nr. 30244197.

Denk s.v.p aan het milieu voor u deze e-mail afdrukt.

This message may contain confidential information and is intended
exclusively for the addressee. If you receive this message
unintentionally, please do not use the contents but notify the sender
immediately by return e-mail. University Medical Center Utrecht is a legal
person by public law and is registered at the Chamber of Commerce for
Midden-Nederland under no. 30244197.

Please consider the environment before printing this e-mail.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com