Thanks, Sage!
In the meanwhile I asked the same question in #Ceph IRC channel and Be_El gave me exactly the same answer, which helped.
I also realized that in http://ceph.com/docs/master/rados/configuration/mon-osd-interaction/ it is stated: "You may change this grace period by adding an osd heartbeatgrace setting under the [osd] section of your Ceph configuration file, or by setting the value at runtime.". But in reality you must add this option to the [global] sections. Settinng this value in [osd] section only influenced only osd daemons, but not monitors.
Anyway, now IO resumes after only 5 seconds freeze. Thanks for help, guys!
Regarding Ceph failure detection: in real environment it seems for me like 20-30 seconds of freeze after a single storage node outage is very expensive.
Even when we talk about data consistency... 5 seconds is acceptable threshold.
But, Sage, can you please explain in brief, what are the drawbacks of lowering the timeout? If for example I got stable 10 gig cluster network which is not likely to lag or interrupt - is 5 seconds dangerous anyhow? How OSDs can report false positives in that case?
Thanks in advance :)
On Wed, May 13, 2015 at 7:05 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
On Wed, 13 May 2015, Vasiliy Angapov wrote:
> Hi,
>
> Well, I've managed to find out that correct stop of osd causes no IO
> downtime (/etc/init.d/ceph stop osd). But that cannot be called a fault
> tolerance, which Ceph is supposed to be.However, "killall -9 ceph-osd"
> causes IO to stop for about 20 seconds.
>
> I've tried lowering some timeouts but without luck. Here is a related part
> of my ceph.conf after lowering the timeout values:
>
> [global]
> heartbeat interval = 5
> mon osd down out interval = 90
> mon pg warn max per osd = 2000
> mon osd adjust heartbeat grace = false
>
> [client]
> rbd cache = false
>
> [mon]
> mon clock drift allowed = .200
> mon osd min down reports = 1
>
> [osd]
> osd heartbeat interval = 3
> osd heartbeat grace = 5
>
> Can you help me to reduce IO downtime somehow? Because 20 seconds for
> production is just horrible.
You'll need to restart ceph-osd daemons for that change to take effect, or
ceph tell osd.\* injectargs '--osd-heartbeat-grace 5 --osd-heartbeat-interval 1'
Just remember that this timeout is a tradeoff against false positives--be
careful tuning it too low.
Note that ext4 going ro after 5 seconds sounds like insanity to me. I've
only seen this with older guest kernels, and iirc the problem is a
120s timeout with ide or something?
Ceph is a CP system that trades availability for consistency--it will
block IO as needed to ensure that it is handling reads or writes in a
completely consistent manner. Even if you get the failure detection
latency down, other recovery scenarios are likely to cross the magic 5s
threshold at some point and cause the same problem. You need to fix your
guests one way or another!
sage
>
> Regards, Vasily.
>
>
> On Wed, May 13, 2015 at 9:57 AM, Vasiliy Angapov <angapov@xxxxxxxxx> wrote:
> Thanks, Gregory!
> My Ceph version is 0.94.1. What I'm trying to test is the worst
> situation when the node is loosing network or becomes inresponsive. So
> what i do is "killall -9 ceph-osd", then reboot.
>
> Well, I also tried to do a clean reboot several times (just a "reboot"
> command), but i saw no difference - there is always an IO freeze for
> about 30 seconds. Btw, i'm using Fedora 20 on all nodes.
>
> Ok, I will play with timeouts more.
>
> Thanks again!
>
> On Wed, May 13, 2015 at 10:46 AM, Gregory Farnum <greg@xxxxxxxxxxx>
> wrote:
> On Tue, May 12, 2015 at 11:39 PM, Vasiliy Angapov
> <angapov@xxxxxxxxx> wrote:
> > Hi, colleagues!
> >
> > I'm testing a simple Ceph cluster in order to use it in
> production
> > environment. I have 8 OSDs (1Tb SATA drives) which are
> evenly distributed
> > between 4 nodes.
> >
> > I'v mapped rbd image on the client node and started
> writing a lot of data to
> > it. Then I just reboot one node and see what's
> happening. What happens is
> > very sad. I have a write freeze for about 20-30 seconds
> which is enough for
> > ext4 filesystem to switch to RO.
> >
> > I wonder, if there is any way to minimize this lag?
> AFAIK, ext filesystems
> > have 5 seconds timeout before switching to RO. So is
> there any way to get
> > that lag beyond 5 secs? I've tried lowering different
> osd timeouts, but it
> > doesn't seem to help.
> >
> > How do you deal with such a situations? 20 seconds of
> downtime is not
> > tolerable in production.
>
> What version of Ceph are you running, and how are you rebooting
> it?
> Any newish version that gets a clean reboot will notify the
> cluster
> that it's shutting down, so you shouldn't witness blocked rights
> really at all.
>
> If you're doing a reboot that involves just ending the daemon,
> you
> will have to wait through the timeout period before the OSD gets
> marked down, which defaults to 30 seconds. This is adjustable
> (look
> for docs on the "osd heartbeat grace" config option), although
> if you
> set it too low you'll need to change a bunch of other timeouts
> which I
> don't know off-hand...
> -Greg
>
>
>
>
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com