Re: Write freeze when writing to rbd image and rebooting one of the nodes

Vasiliy Angapov <angapov@xxxxxxxxx> · Wed, 13 May 2015 17:17:58 +0300

Hi,

Well, I've managed to find out that correct stop of osd causes no IO downtime (/etc/init.d/ceph stop osd). But that cannot be called a fault tolerance, which Ceph is supposed to be.However, "killall -9 ceph-osd" causes IO to stop for about 20 seconds.

I've tried lowering some timeouts but without luck. Here is a related part of my ceph.conf after lowering the timeout values:

[global]
heartbeat interval = 5
mon osd down out interval = 90
mon pg warn max per osd = 2000
mon osd adjust heartbeat grace = false

[client]
rbd cache = false

[mon]
mon clock drift allowed = .200
mon osd min down reports = 1

[osd]
osd heartbeat interval = 3
osd heartbeat grace = 5

Can you help me to reduce IO downtime somehow? Because 20 seconds for production is just horrible.

Regards, Vasily.

On Wed, May 13, 2015 at 9:57 AM, Vasiliy Angapov <angapov@xxxxxxxxx> wrote:
Thanks, Gregory!
My Ceph version is 0.94.1. What I'm trying to test is the worst situation when the node is loosing network or becomes inresponsive. So what i do is "killall -9 ceph-osd", then reboot.

Well, I also tried to do a clean reboot several times (just a "reboot" command), but i saw no difference - there is always an IO freeze for about 30 seconds. Btw, i'm using Fedora 20 on all nodes.

Ok, I will play with timeouts more.

Thanks again!

On Wed, May 13, 2015 at 10:46 AM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
On Tue, May 12, 2015 at 11:39 PM, Vasiliy Angapov <angapov@xxxxxxxxx> wrote:

> Hi, colleagues!

>

> I'm testing a simple Ceph cluster in order to use it in production

> environment. I have 8 OSDs (1Tb SATA  drives) which are evenly distributed

> between 4 nodes.

>

> I'v mapped rbd image on the client node and started writing a lot of data to

> it. Then I just reboot one node and see what's happening. What happens is

> very sad. I have a write freeze for about 20-30 seconds which is enough for

> ext4 filesystem to switch to RO.

>

> I wonder, if there is any way to minimize this lag? AFAIK, ext filesystems

> have 5 seconds timeout before switching to RO. So is there any way to get

> that lag beyond 5 secs? I've tried lowering different osd timeouts, but it

> doesn't seem to help.

>

> How do you deal with such a situations? 20 seconds of downtime is not

> tolerable in production.

What version of Ceph are you running, and how are you rebooting it?

Any newish version that gets a clean reboot will notify the cluster

that it's shutting down, so you shouldn't witness blocked rights

really at all.

If you're doing a reboot that involves just ending the daemon, you

will have to wait through the timeout period before the OSD gets

marked down, which defaults to 30 seconds. This is adjustable (look

for docs on the "osd heartbeat grace" config option), although if you

set it too low you'll need to change a bunch of other timeouts which I

don't know off-hand...

-Greg

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com