Re: Ceph in Production: best practice to monitor OSD up/down status

Xabier Elkano <xelkano@xxxxxxxxxxxx> · Mon, 23 Mar 2015 11:02:12 +0100

El 22/03/15 a las 10:55, Saverio Proto escribió:
> Hello,
>
> I started to work with CEPH few weeks ago, I might ask a very newbie
> question, but I could not find an answer in the docs or in the ml
> archive for this.
>
> Quick description of my setup:
> I have a ceph cluster with two servers. Each server has 3 SSD drives I
> use for journal only. To map to different failure domains SAS disks
> that keep a journal to the same SSD drive, I wrote my own crushmap.
> I have now a total of 36OSD. Ceph health returns HEALTH_OK.
> I run the cluster with a couple of pools with size=3 and min_size=3
>
>
> Production operations questions:
> I manually stopped some OSDs to simulate a failure.
>
> As far as I understood, an "OSD down" condition is not enough to make
> CEPH start making new copies of objects. I noticed that I must mark
> the OSD as "out" to make ceph produce new copies.
> As far as I understood min_size=3 puts the object in readonly if there
> are not at least 3 copies of the object available.
>
> Is this behavior correct or I made some mistake creating the cluster ?
> Should I expect ceph to produce automatically a new copy for objects
> when some OSDs are down ?
> There is any option to mark automatically "out" OSDs that go "down" ?
Hi,

you should set this parameter in your ceph config file in mon section:

mon_osd_down_out_interval = 900

to set the interval (in seconds) that ceph will wait to put the OSD as
out and to start making new copies after a down osd is detected. By
default it is set in 600 seconds.

BR,
Xabier
>
> thanks
>
> Saverio
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com