Re: Disk Down Emergency

Georgios Dimitrakakis <giorgis@xxxxxxxxxxxx> · Thu, 16 Nov 2017 18:16:13 +0200

I would like to thank all of you very much for your assistance, help, 
support and time.

I have to say that I totally agree with you regarding the number of 
replicas and probably this is the best time to switch to 3 replicas 
since all services have been stopped due to this emergency.

After I 've removed the OSD from CRUSH the cluster started backfilling 
which finished successfully.
I have to say here that before removing the OSD from CRUSH I stopped 
scrubbing which I re-enabled it as soon as backfilling was finished 
successfully.

So the cluster now is scrubbing again and I am wondering if it is 
necessary to let it finish scrubbing and/or even issue a deep-scrubbing 
before changing the physical disk or should I proceed to changing the 
disk as soon as possible before any further actions?
What would you do?

Best,

G.

The first step is to make sure that it is out of the cluster.  Does
`ceph osd stat` show the same number of OSDs as in (its the same as a
line from `ceph status`)?  It should show 1 less for up, but if its
still registering the OSD as in then the backfilling wont start. 
`ceph osd out 0` should mark it out and let the backfilling start.

If its already out, then `ceph osd crush remove osd.0; ceph auth del
osd.0; ceph osd rm 0` will finish removing it from the cluster and 
let
you move forward.  Its a good idea to wait to run these commands once
you have a full copy of your data again, so really try to let the
cluster do its thing if you can.  What generally happens and what
everyone (including myself) is recommending for you, is to let the 
OSD
get marked down (which it has done), and then marked out.  Once it is
marked out, the cluster will rebalance and make sure that all of the
data that is on the out OSD is replicated to have the full number of
copies again.  My guess is that you just have a setting somewhere
preventing the OSD from being marked out.

As far as your customer not understanding that 2 replicas is bad for
production data, write up a contract that they have to sign
indemnifying you of any responsibility if they lose data because you
have warned them to have 3 replicas.  If they dont sign it, then tell
them you will no longer manage Ceph for them.  Hopefully they wake up
and make everyones job easier by purchasing a third server.

On Thu, Nov 16, 2017 at 9:26 AM Georgios Dimitrakakis  wrote:

 Thank you all for your time and support.

 I dont see any backfilling in the logs and the number of
 "active+degraded" as well as "active+remapped" and "active+clean"
 objects is the same for some time now. The only thing I see is
 "scrubbing".

 Wido, I cannot do anything with the data in osd.0 since although
the
 failed disk seems mounted I cannot see anything and I am getting
an
 "Input/output" error.

 So I guess the right action for now is to remove the OSD by
issuing
 "ceph osd crush remove  osd.0" as Sean suggested, correct?

 G.

>> Op 16 november 2017 om 14:46 schreef Caspar Smit
>> :
>>
>>
>> 2017-11-16 14:43 GMT+01:00 Wido den Hollander :
>>
>> >
>> > > Op 16 november 2017 om 14:40 schreef Georgios Dimitrakakis <
>> > giorgis@xxxxxxxxxxxx [3]>:
>> > >
>> > >
>> > >  @Sean Redmond: No I dont have any unfound objects. I only
have
>> "stuck
>> > >  unclean" with "active+degraded" status
>> > >  @Caspar Smit: The cluster is scrubbing ...
>> > >
>> > >  @All: My concern is because of one copy left for the data
on
>> the failed
>> > >  disk.
>> > >
>> >
>> > Let the Ceph recovery do its work. Dont do anything manually
>> now.
>> >
>> >
>> @Wido, i think his cluster might have stopped recovering because
of
>> non-optimal tunables in firefly.
>>
>
> Ah, darn. Yes, thats been a long time ago. Could very well be the
> case.
>
> He could try to remove osd.0 from the CRUSHMap and let recovery
> progress.
>
> I would however advise him not to fiddle with the data on osd.0.
Do
> not try to copy the data somewhere else and try to fix the OSD.
>
> Wido
>
>>
>> > >  If I just remove the OSD.0 from crush map does that copy
all
>> its data
>> > >  from the only one available copy to the rest unaffected
disks
>> which will
>> > >  consequently end in having again two copies on two
different
>> hosts?
>> > >
>> >
>> > Do NOT copy the data from osd.0 to another OSD. Let the Ceph
>> recovery
>> > handle this.
>> >
>> > It is already marked as out and within 24 hours or so recovery
>> will have
>> > finished.
>> >
>> > But a few things:
>> >
>> > - Firefly 0.80.9 is old
>> > - Never, never, never run with size=2
>> >
>> > Not trying to scare you, but its a reality.
>> >
>> > Now let Ceph handle the rebalance and wait.
>> >
>> > Wido
>> >
>> > >  Best,
>> > >
>> > >  G.
>> > >
>> > >
>> > > > 2017-11-16 14:05 GMT+01:00 Georgios Dimitrakakis :
>> > > >
>> > > >> Dear cephers,
>> > > >>
>> > > >> I have an emergency on a rather small ceph cluster.
>> > > >>
>> > > >> My cluster consists of 2 OSD nodes with 10 disks x4TB
each
>> and 3
>> > > >> monitor nodes.
>> > > >>
>> > > >> The version of ceph running is Firefly v.0.80.9
>> > > >> (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
>> > > >>
>> > > >> The cluster originally was build with "Replicated size=2"
and
>> "Min
>> > > >> size=1" with the attached crush map,
>> > > >> which in my understanding this replicates data across
hosts.
>> > > >>
>> > > >> The emergency comes from the violation of the golden
rule:
>> "Never
>> > > >> use 2 replicas on a production cluster"
>> > > >>
>> > > >> Unfortunately the customers never really understood well
the
>> risk
>> > > >> and now that one disk is down I am in the middle and I
must
>> do
>> > > >> everything in my power not to loose any data, thus I am
>> requesting
>> > > >> your assistance.
>> > > >>
>> > > >> Here is the output of
>> > > >>
>> > > >> $ ceph osd tree
>> > > >> # id    weight  type name       up/down reweight
>> > > >> -1      72.6    root default
>> > > >> -2      36.3            host store1
>> > > >> 0       3.63                    osd.0 
 down
>> > > >> 0       ---> DISK DOWN
>> > > >> 1       3.63                    osd.1 
 up
>> > > >> 1
>> > > >> 2       3.63                    osd.2 
 up
>> > > >> 1
>> > > >> 3       3.63                    osd.3 
 up
>> > > >> 1
>> > > >> 4       3.63                    osd.4 
 up
>> > > >> 1
>> > > >> 5       3.63                    osd.5 
 up
>> > > >> 1
>> > > >> 6       3.63                    osd.6 
 up
>> > > >> 1
>> > > >> 7       3.63                    osd.7 
 up
>> > > >> 1
>> > > >> 8       3.63                    osd.8 
 up
>> > > >> 1
>> > > >> 9       3.63                    osd.9 
 up
>> > > >> 1
>> > > >> -3      36.3            host store2
>> > > >> 10      3.63                    osd.10 
up      1
>> > > >> 11      3.63                    osd.11 
up      1
>> > > >> 12      3.63                    osd.12 
up      1
>> > > >> 13      3.63                    osd.13 
up      1
>> > > >> 14      3.63                    osd.14 
up      1
>> > > >> 15      3.63                    osd.15 
up      1
>> > > >> 16      3.63                    osd.16 
up      1
>> > > >> 17      3.63                    osd.17 
up      1
>> > > >> 18      3.63                    osd.18 
up      1
>> > > >> 19      3.63                    osd.19 
up      1
>> > > >>
>> > > >> and here is the status of the cluster
>> > > >>
>> > > >> # ceph health
>> > > >> HEALTH_WARN 497 pgs degraded; 549 pgs stuck unclean;
recovery
>> > > >> 51916/2552684 objects degraded (2.034%)
>> > > >>
>> > > >> Althoug OSD.0 is shown as mounted it cannot be started
>> (probably
>> > > >> failed disk controller problem)
>> > > >>
>> > > >> # df -h
>> > > >> Filesystem      Size  Used Avail Use% Mounted on
>> > > >> /dev/sda3       251G  4.1G  235G   2% /
>> > > >> tmpfs            24G     0   24G   0%
/dev/shm
>> > > >> /dev/sda1       239M  100M  127M  44% /boot
>> > > >> /dev/sdj1       3.7T  223G  3.5T   6%
>> > > >> /var/lib/ceph/osd/ceph-8
>> > > >> /dev/sdh1       3.7T  205G  3.5T   6%
>> > > >> /var/lib/ceph/osd/ceph-6
>> > > >> /dev/sdg1       3.7T  199G  3.5T   6%
>> > > >> /var/lib/ceph/osd/ceph-5
>> > > >> /dev/sde1       3.7T  180G  3.5T   5%
>> > > >> /var/lib/ceph/osd/ceph-3
>> > > >> /dev/sdi1       3.7T  187G  3.5T   6%
>> > > >> /var/lib/ceph/osd/ceph-7
>> > > >> /dev/sdf1       3.7T  193G  3.5T   6%
>> > > >> /var/lib/ceph/osd/ceph-4
>> > > >> /dev/sdd1       3.7T  212G  3.5T   6%
>> > > >> /var/lib/ceph/osd/ceph-2
>> > > >> /dev/sdk1       3.7T  210G  3.5T   6%
>> > > >> /var/lib/ceph/osd/ceph-9
>> > > >> /dev/sdb1       3.7T  164G  3.5T   5%
>> > > >> /var/lib/ceph/osd/ceph-0    ---> This is the
problematic OSD
>> > > >> /dev/sdc1       3.7T  183G  3.5T   5%
>> > > >> /var/lib/ceph/osd/ceph-1
>> > > >>
>> > > >> # service ceph start osd.0
>> > > >> find: `/var/lib/ceph/osd/ceph-0: Input/output error
>> > > >> /etc/init.d/ceph: osd.0 not found (/etc/ceph/ceph.conf
>> defines
>> > > >> mon.store1 osd.6 osd.9 osd.1 osd.4 osd.3 osd.2 osd.8
osd.5
>> osd.7
>> > > >> mds.store1 mon.store3, /var/lib/ceph defines mon.store1
osd.6
>> osd.9
>> > > >> osd.1 osd.4 osd.3 osd.2 osd.8 osd.5 osd.7 mds.store1)
>> > > >>
>> > > >> I have found this:
>> > > >>
>> > > >
>> > > > http://ceph.com/geen-categorie/admin-guide- [4]
>> > replacing-a-failed-disk-in-a-ceph-cluster/
>> > > >> [1]
>> > > >>
>> > > >> and I am looking for your guidance in order to properly
>> perform all
>> > > >> actions in order not to loose any data and keep the ones
of
>> the
>> > > >> second copy.
>> > > >
>> > > > What guidance are you looking for besides the steps to
replace
>> a
>> > > > failed disk (which you already found) ?
>> > > > If i look at your situation, there is nothing down in
terms of
>> > > > availability of pgs, just a failed drive which needs to be
>> replaced.
>> > > >
>> > > > Is the cluster still recovering? It should reach HEALTH_OK
>> again
>> > > > after
>> > > > rebalancing the cluster when an OSD goes down.
>> > > >
>> > > > If it stopped recovering it might have to do with the ceph
>> tunables
>> > > > which are not set to optimal by default on firefly and
that
>> prevents
>> > > > further rebalancing.
>> > > > WARNING: Dont just set tunables to optimal because it will
>> trigger a
>> > > > massive rebalance!
>> > > >
>> > > > Perhaps the second golden rule is to never run a CEPH
>> production
>> > > > cluster without knowing (and testing) how to replace a
failed
>> drive.
>> > > > (Im not trying to be harsh here).
>> > > >
>> > > > Kind regards,
>> > > > Caspar
>> > > >
>> > > >
>> > > >> Best regards,
>> > > >>
>> > > >> G.
>> > > >> _______________________________________________
>> > > >> ceph-users mailing list
>> > > >> ceph-users@xxxxxxxxxxxxxx [5] [2]
>> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[6] [3]
>> > > >
>> > > >
>> > > >
>> > > > Links:
>> > > > ------
>> > > > [1]
>> > > >
>> > > > http://ceph.com/geen-categorie/admin-guide- [7]
>> > replacing-a-failed-disk-in-a-ceph-cluster/
>> > > > [2] mailto:ceph-users@xxxxxxxxxxxxxx [8]
>> > > > [3] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[9]
>> > > > [4] mailto:giorgis@xxxxxxxxxxxx [10]
>> > > _______________________________________________
>> > > ceph-users mailing list
>> > > ceph-users@xxxxxxxxxxxxxx [11]
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [12]
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx [13]
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [14]
>> >
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx [15]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [16]
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx [17]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [18]

--
 Dr. Dimitrakakis Georgios

 Networks and Systems Administrator

 Archimedes Center for Modeling, Analysis & Computation (ACMAC)
 School of Sciences and Engineering
 University of Crete
 P.O. Box 2208
 710 - 03 Heraklion
 Crete, Greece

 Tel: +30 2810 393717
 Fax: +30 2810 393660

 E-mail: giorgis@xxxxxxxxxxxx [19]
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx [20]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [21]

Links:
------
[1] mailto:casparsmit@xxxxxxxxxxx
[2] mailto:wido@xxxxxxxx
[3] mailto:giorgis@xxxxxxxxxxxx
[4] http://ceph.com/geen-categorie/admin-guide-
[5] mailto:ceph-users@xxxxxxxxxxxxxx
[6] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[7] http://ceph.com/geen-categorie/admin-guide-
[8] mailto:ceph-users@xxxxxxxxxxxxxx
[9] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[10] mailto:giorgis@xxxxxxxxxxxx
[11] mailto:ceph-users@xxxxxxxxxxxxxx
[12] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[13] mailto:ceph-users@xxxxxxxxxxxxxx
[14] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[15] mailto:ceph-users@xxxxxxxxxxxxxx
[16] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[17] mailto:ceph-users@xxxxxxxxxxxxxx
[18] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[19] mailto:giorgis@xxxxxxxxxxxx
[20] mailto:ceph-users@xxxxxxxxxxxxxx
[21] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[22] mailto:giorgis@xxxxxxxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com