Hello, thanks for the answers. This was exacly what I was looking for: mon_osd_down_out_interval = 900 I was not waiting long enoght to see my cluster recovering by itself. That's why I tried to increase min_size, because I did not understand what min_size was for. Now that I know what is min_size, I guess the best setting for me is min_size = 1 because I would like to be able to make I/O operations even of only 1 copy is left. Thanks to all for helping ! Saverio 2015-03-23 14:58 GMT+01:00 Gregory Farnum <greg@xxxxxxxxxxx>: > On Sun, Mar 22, 2015 at 2:55 AM, Saverio Proto <zioproto@xxxxxxxxx> wrote: >> Hello, >> >> I started to work with CEPH few weeks ago, I might ask a very newbie >> question, but I could not find an answer in the docs or in the ml >> archive for this. >> >> Quick description of my setup: >> I have a ceph cluster with two servers. Each server has 3 SSD drives I >> use for journal only. To map to different failure domains SAS disks >> that keep a journal to the same SSD drive, I wrote my own crushmap. >> I have now a total of 36OSD. Ceph health returns HEALTH_OK. >> I run the cluster with a couple of pools with size=3 and min_size=3 >> >> >> Production operations questions: >> I manually stopped some OSDs to simulate a failure. >> >> As far as I understood, an "OSD down" condition is not enough to make >> CEPH start making new copies of objects. I noticed that I must mark >> the OSD as "out" to make ceph produce new copies. >> As far as I understood min_size=3 puts the object in readonly if there >> are not at least 3 copies of the object available. > > That is correct, but the default with size 3 is 2 and you probably > want to do that instead. If you have size==min_size on firefly > releases and lose an OSD it can't do recovery so that PG is stuck > without manual intervention. :( This is because of some quirks about > how the OSD peering and recovery works, so you'd be forgiven for > thinking it would recover nicely. > (This is changed in the upcoming Hammer release, but you probably > still want to allow cluster activity when an OSD fails, unless you're > very confident in their uptime and more concerned about durability > than availability.) > -Greg > >> >> Is this behavior correct or I made some mistake creating the cluster ? >> Should I expect ceph to produce automatically a new copy for objects >> when some OSDs are down ? >> There is any option to mark automatically "out" OSDs that go "down" ? >> >> thanks >> >> Saverio >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com