On Tue, May 12, 2015 at 11:39 PM, Vasiliy Angapov <angapov@xxxxxxxxx> wrote: > Hi, colleagues! > > I'm testing a simple Ceph cluster in order to use it in production > environment. I have 8 OSDs (1Tb SATA drives) which are evenly distributed > between 4 nodes. > > I'v mapped rbd image on the client node and started writing a lot of data to > it. Then I just reboot one node and see what's happening. What happens is > very sad. I have a write freeze for about 20-30 seconds which is enough for > ext4 filesystem to switch to RO. > > I wonder, if there is any way to minimize this lag? AFAIK, ext filesystems > have 5 seconds timeout before switching to RO. So is there any way to get > that lag beyond 5 secs? I've tried lowering different osd timeouts, but it > doesn't seem to help. > > How do you deal with such a situations? 20 seconds of downtime is not > tolerable in production. What version of Ceph are you running, and how are you rebooting it? Any newish version that gets a clean reboot will notify the cluster that it's shutting down, so you shouldn't witness blocked rights really at all. If you're doing a reboot that involves just ending the daemon, you will have to wait through the timeout period before the OSD gets marked down, which defaults to 30 seconds. This is adjustable (look for docs on the "osd heartbeat grace" config option), although if you set it too low you'll need to change a bunch of other timeouts which I don't know off-hand... -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com