Re: Write freeze when writing to rbd image and rebooting one of the nodes

Vasiliy Angapov <angapov@xxxxxxxxx> · Thu, 14 May 2015 00:32:35 +0400

Robert, thank you very much for sharing your wisdom with me! Much appreciated.
I think I more or less got your point. Ceph is not a SAN, this sounds logical. 
What i'm trying to understand is what Ceph is for and what it is not for... Is there any article about that? :)

I've heard Ceph is an enterprise storage, but spending 20 seconds waiting for IO doesn't sound very enterprise.
Does that mean that i should really switch back to local storage systems (or expensive SDS like Nutanix) and forget about Ceph? 
Look, I'm not critisizing, just want to get things clear. 

I know that network is mostly crucial for Ceph. I can accept it, I can build rather expensive but reliable cluster network. Even Infiniband if required.
I know that Ceph is not good at small clusters like 10 OSDs and less. I can buy more hosts and more disks, this is not a problem.
I know that commodity hardware is not very reliable, I can buy nice servers, nice controllers, nice disks and NVME SSDs.
Considering the layout, will this all be enough for Ceph to not to lay down on the ground with 5 seconds osd grace interval under high loads?

Or should i go somewhere else with my enterprise wishes?

For now, I have rather small test cluster with a several SATAs and SSDs, so my real concern is - should I say to my boss: "Ok, seems like this is working fine, let's go ahead then"?
What do you think of it, colleagues? Would be very grateful for any shared knowledge.  

On Wed, May 13, 2015 at 11:21 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Don't let me deter you from pointing a gun at your foot. I think you may be confusing Ceph for a SAN, which it certainly is not. In an expensive SAN, 5 seconds it very unacceptable for a failover event between controllers. However there are many safeguards put in place to ensure that there is no corruption using custom communication channels and protocols to ensure fencing mis-behaving controllers.

Ceph is a distributed storage system built on commodity hardware and is striving to fill a different role. In the case of distributed systems you have to get a consensus of the state of each member to know what is really going on. Since there are more external forces in a distributed system, it usually can't be as quick to make decisions. Since the direct communication channels don't exist and things like flaky network cables, cards, loops, etc call all impact the ability of the cluster to have a good idea of the state of each node, you usually have to wait some time. This is partly due to the protocols used like TCP which has certain time-out periods, etc.

Recovery in Ceph is also more expensive than a SAN. SANs rely on controllers being able to access the same disks on the backend, Ceph replicates the data to provide availability. This can stress even very high bandwidth networks, disks, CPUs, etc. You don't want to jump into recovery too quickly as it can cause a cascading effect.

This gets to answering your question directly. If you have very short timing for detecting failures, then you can get into a feedback loop that takes your whole cluster down and it can't get healthy. If a node fails and recovery starts, other nodes may be too busy to respond to heart beats fast enough or the heartbeats are lost in transit and then they start getting marked down incorrectly and another round of recoveries start stressing the remaining OSDs causing a downward spiral. At the same time, OSDs that were wrongly marked down start responding and are brought back in the cluster and the recovery restarts and the you get this constant OSD flapping.

While 30 seconds seems like a long time, it is certainly better than being offline for a month due to corrupted data (true story, I have the scars to prove it). I make it a habit for any virtualized machines to increase the timeout of the file system to 300 seconds to help weather the storms that may happen anywhere in the pipe between the VM OS -> hypervisor -> storage system. 30 seconds is well within the time that the typical end user will just punch the refresh button if they are impatient and the page will load pretty quick after that and they won't think twice.

However, if you really need 5 seconds or less for failover of unexpected failures, I would suggest that you reevaluate the things that are important to you and the trade-offs you are willing to make. Although Ceph is not a perfect storage system, I have been very happy with the resiliency and performance it provides and at a wonderful price point. Even spending lots of $$ on Fibre Channel SANs does not guarantee great reliability or performance. I've had my fair share of outages on these where failovers did not work properly or some bug that has left their engineers scratching their heads for years, never to be fixed.

- ----------------
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, May 13, 2015 at 10:29 AM, Vasiliy Angapov  wrote:
Thanks, Sage!

In the meanwhile I asked the same question in #Ceph IRC channel and Be_El gave me exactly the same answer, which helped.
I also realized that in http://ceph.com/docs/master/rados/configuration/mon-osd-interaction/ it is stated: "You may change this grace period by adding an osd heartbeatgrace setting under the [osd] section of your Ceph configuration file, or by setting the value at runtime.". But in reality you must add this option to the [global] sections. Settinng this value in [osd] section only influenced only osd daemons, but not monitors.

Anyway, now IO resumes after only 5 seconds freeze. Thanks for help, guys!

Regarding Ceph failure detection: in real environment it seems for me like 20-30 seconds of freeze after a single storage node outage is very expensive.
Even when we talk about data consistency...  5 seconds is acceptable threshold.

But, Sage, can you please explain in brief, what are the drawbacks of lowering the timeout? If for example I got stable 10 gig cluster network which is not likely to lag or interrupt - is 5 seconds dangerous anyhow? How OSDs can report false positives in that case?

Thanks in advance :)

On Wed, May 13, 2015 at 7:05 PM, Sage Weil  wrote:
On Wed, 13 May 2015, Vasiliy Angapov wrote:
> Hi,
>
> Well, I've managed to find out that correct stop of osd causes no IO
> downtime (/etc/init.d/ceph stop osd). But that cannot be called a fault
> tolerance, which Ceph is supposed to be.However, "killall -9 ceph-osd"
> causes IO to stop for about 20 seconds.
>
> I've tried lowering some timeouts but without luck. Here is a related part
> of my ceph.conf after lowering the timeout values:
>
> [global]
> heartbeat interval = 5
> mon osd down out interval = 90
> mon pg warn max per osd = 2000
> mon osd adjust heartbeat grace = false
>
> [client]
> rbd cache = false
>
> [mon]
> mon clock drift allowed = .200
> mon osd min down reports = 1
>
> [osd]
> osd heartbeat interval = 3
> osd heartbeat grace = 5
>
> Can you help me to reduce IO downtime somehow? Because 20 seconds for
> production is just horrible.

You'll need to restart ceph-osd daemons for that change to take effect, or

 ceph tell osd.\* injectargs '--osd-heartbeat-grace 5 --osd-heartbeat-interval 1'

Just remember that this timeout is a tradeoff against false positives--be
careful tuning it too low.

Note that ext4 going ro after 5 seconds sounds like insanity to me.  I've
only seen this with older guest kernels, and iirc the problem is a
120s timeout with ide or something?

Ceph is a CP system that trades availability for consistency--it will
block IO as needed to ensure that it is handling reads or writes in a
completely consistent manner.  Even if you get the failure detection
latency down, other recovery scenarios are likely to cross the magic 5s
threshold at some point and cause the same problem.  You need to fix your
guests one way or another!

sage

>
> Regards, Vasily.
>
>
> On Wed, May 13, 2015 at 9:57 AM, Vasiliy Angapov  wrote:
>       Thanks, Gregory!
> My Ceph version is 0.94.1. What I'm trying to test is the worst
> situation when the node is loosing network or becomes inresponsive. So
> what i do is "killall -9 ceph-osd", then reboot.
>
> Well, I also tried to do a clean reboot several times (just a "reboot"
> command), but i saw no difference - there is always an IO freeze for
> about 30 seconds. Btw, i'm using Fedora 20 on all nodes.
>
> Ok, I will play with timeouts more.
>
> Thanks again!
>
> On Wed, May 13, 2015 at 10:46 AM, Gregory Farnum 
> wrote:
>       On Tue, May 12, 2015 at 11:39 PM, Vasiliy Angapov
>        wrote:
>       > Hi, colleagues!
>       >
>       > I'm testing a simple Ceph cluster in order to use it in
>       production
>       > environment. I have 8 OSDs (1Tb SATA  drives) which are
>       evenly distributed
>       > between 4 nodes.
>       >
>       > I'v mapped rbd image on the client node and started
>       writing a lot of data to
>       > it. Then I just reboot one node and see what's
>       happening. What happens is
>       > very sad. I have a write freeze for about 20-30 seconds
>       which is enough for
>       > ext4 filesystem to switch to RO.
>       >
>       > I wonder, if there is any way to minimize this lag?
>       AFAIK, ext filesystems
>       > have 5 seconds timeout before switching to RO. So is
>       there any way to get
>       > that lag beyond 5 secs? I've tried lowering different
>       osd timeouts, but it
>       > doesn't seem to help.
>       >
>       > How do you deal with such a situations? 20 seconds of
>       downtime is not
>       > tolerable in production.
>
> What version of Ceph are you running, and how are you rebooting
> it?
> Any newish version that gets a clean reboot will notify the
> cluster
> that it's shutting down, so you shouldn't witness blocked rights
> really at all.
>
> If you're doing a reboot that involves just ending the daemon,
> you
> will have to wait through the timeout period before the OSD gets
> marked down, which defaults to 30 seconds. This is adjustable
> (look
> for docs on the "osd heartbeat grace" config option), although
> if you
> set it too low you'll need to change a bunch of other timeouts
> which I
> don't know off-hand...
> -Greg
>
>
>
>
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVU6QiCRDmVDuy+mK58QAAYUEP/AlFtupR0HhnR/+1bZO0
1kbhLlwLpIZw0OYDe6OQiBBc8UKF49H8jG4ddkmNqsdNxASc6QuRMNRnSipP
+mxdOGUDXALvM0X83Dvf8oTbkTw11ofCFHXD6xV5Nua4OHevswC0Lo13rO4f
nzk4P3onr1CrqPHhIuGvLxMkhizHXooJvsnvKdzkXMbGoYY05tqC1++WB9NN
gmy5ABBaCrAIXnveBD35ZRHPTnRrzNGkQZQ3+ICQYWVCFqj8AzCcLwuyKgnc
bkwkCc34XOfUrrmByFM58O6+etu6AjyOEcCZh9W5Niwoy99qQPkiXa8RyJ3A
pd5M9v9AqXCSHE+DDyVATZkmLBiqS/SeQiV8Rrz7siAb07Wuc5jDgQwntYnu
a9ziNXlH2v3hK9gUEdGDKw1HDf53NnpPTj9cHbA4zwUt4v5LJX4FOOYYV19W
te3TrDigouxI9Z4m9QHItxNUBAHIYt2IAug0BdhYXgBMRZjebsDCGfYq0qnJ
K5aXSK1WgcseoOC4B6BZofNFsWen46HPk+bekiBfa87nBjBZB8XLnl7Q9uxe
m+faknIRVjZKRAvB+CeegT7lVupEG56IkCOjZjnwXIDfyzidLbKIn3oAJ8Ld
oG0oF2pspQ0gfvDhb2tvQnBtBZj43f3lLAkR1URv1hbqyEl97tDnf2OOQJp2
MvnZ
=lnv1
-----END PGP SIGNATURE-----

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com