Re: Write freeze when writing to rbd image and rebooting one of the nodes

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Wed, 13 May 2015 15:47:03 -0600

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Ceph is an Object Storage system that can also do block devices and a file system. My impressions (and these are mine) is that it excels at data integrity and providing enterprise class storage in a free and open source format. It is a highly configurable and distributed storage system which can tolerate many components failing. Probably the most significant thing for me is the amount of features and active community, and if I think it is lacking something or done wrong, I can alter the code myself (not that I'm that smart). One thing that I like about Ceph is that objects are mutable which most other object storage systems don't have which makes RBD possible with reasonable costs. Ceph, as with other distributed systems, allows you to scale not only storage, and IO performance but also bandwidth and reliability as the cluster grows.

It is really expensive and difficult to get a SAN that can survive a row failure in a data center, it is really easy to do in Ceph. I would also warn about viewing "Enterprise" in a certain way, there are many different ways people define Enterprise. To some it is that it has a gold support contract (someone else can be blamed for problems), others it means that is is reliable (usually it is an older tried and true technology), and to others it is the gold standard and the best possible technology (this has never been my case, and I'm more inclined to believe this is fantasy that fell for in earlier years).

I hope that you find Ceph fits your use case, but if failover in less than 5 seconds is your #1 priority and money is no option, then Ceph may not be the best fit. (Honestly, I haven't seen a storage system that has reliably provided <5 second failover in every situation and I've worked on some big name brand expensive systems in my day. I've seen redundant local storage partially fail causing >5 second outages as well. Personally, in my experience I think this requirement is an unrealistic expectation.)

Infiniband and more expensive servers really won't "fix" the challenges with commodity distributed systems. Even with heartbeat/pacemaker/corosync clusters there is usually about 30 seconds between when the partner stops getting heartbeats (usually on a cross over cable) and it starts serving the partner's services. It has to make sure that the other node just didn't have a hiccup and then has to shut the bad node down (IPMI, etc) and fence it (setting the network switch ports down in case it comes up again). If there is split brain situation they both can, at the same time, send the IPMI kill command because they got trigger happy and didn't wait long enough. So instead of using pistols and being very selective about what is being killed, grenades are being thrown in every direction and shrapnel is taking everything out.

This is a very common problem with clustered and distributed services and a trade off has to be made at some point. The more I think about reducing these timeouts the more I get nervous. For instance if there is a topology change in an Ethernet network, it can take 30-50 seconds for Spanning Tree to converge a network and 2-3 seconds or more for Rapid Spanning Tree. A 5 seconds failover timeout does not leave a lot room for any type of errors outside of a node just failing. I know you say that your network is rock solid, but most of my problems have come from a network that had such claims, even Infiniband. It is surprising how fragile networks are.

We have implemented caching which helps make the failover time not so painful, but it is not a complete solution for the failover time. We have tortured our Ceph clusters and the fact that it heals itself automatically and we haven't lost any data is much more comforting than a short interruption (setting the fs timeout to 300 seconds only makes it a pause and not an error/failure) when an unexpected failure happens. For us the benefits of everything else Ceph offers outweighed this one challenge and we have done all we can to reduce the effects of the failover time to something that we feel comfortable with.

I just want to reiterate that these are my views only and may not reflect those view of Sage and others involved with the Ceph project.
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVU8ZQCRDmVDuy+mK58QAAf9YQAMJsPE6A3c3aJk1E6gdq
+OYCg42PWWrooUmjMq6ECyXUhnE88drf+TnEyX1TogUPszaHFP4S8HuNUf79
eifvqMbnOsc6CLr0f8bE889Z8hDhJQnLw1XISGoNj07ASV+sCepJzLbNtRfs
wbhl9DAbCixtsDXqOaq7qfx/cXa5l5/hqq3obgglxsVPbu9mAjpjnx7+J3YP
bHkzFMwG2gnmzTZMBWjjBtZ1o0/C3NRCg6NGNaD9M0Yj9vsBz4A1+7TijAEK
CutcJlPxcvj7V5XjuHmlmuFenO2qhOAIB37QXsvRHFQD0xt+VftMMMx9WH+q
eiBbNEa6BGTt22JFiZ5wOGFTaNQJXKjAtOo/thjc4NRleSx+Sa234bzJZmu4
+3aZYDiMZVg27QqcCRioncD1BFNtkZjLxrp+p5nsj9ncqLiqXe2jHYo+bU/j
aDNTj6Zi6SE+BAfjCyQb3Q1FLyM5wfCc0kaQprqXX8/ZWwlifATF/0eJ6Rfz
a3TGDgiyIfxlw4+rc9NvSV227JNsDp4YLzpSvQlyIkidrRdFtaUdYL04TvoK
MGo0FPLaDeGt8gsdHyDuR0E8ejDKImjur2eLrRtn4P96VNrUF5ZNzFJLkDlq
i/SiI9+beGNcYV72/hiJrp/zIqjtgdJfxLAfDGVocvDe4qHbtF9V7oEPB+Gh
7YP4
=c0TU
-----END PGP SIGNATURE-----

----------------Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, May 13, 2015 at 2:32 PM, Vasiliy Angapov <angapov@xxxxxxxxx> wrote:
Robert, thank you very much for sharing your wisdom with me! Much appreciated.
I think I more or less got your point. Ceph is not a SAN, this sounds logical. 
What i'm trying to understand is what Ceph is for and what it is not for... Is there any article about that? :)

I've heard Ceph is an enterprise storage, but spending 20 seconds waiting for IO doesn't sound very enterprise.
Does that mean that i should really switch back to local storage systems (or expensive SDS like Nutanix) and forget about Ceph? 
Look, I'm not critisizing, just want to get things clear. 

I know that network is mostly crucial for Ceph. I can accept it, I can build rather expensive but reliable cluster network. Even Infiniband if required.
I know that Ceph is not good at small clusters like 10 OSDs and less. I can buy more hosts and more disks, this is not a problem.
I know that commodity hardware is not very reliable, I can buy nice servers, nice controllers, nice disks and NVME SSDs.
Considering the layout, will this all be enough for Ceph to not to lay down on the ground with 5 seconds osd grace interval under high loads?

Or should i go somewhere else with my enterprise wishes?

For now, I have rather small test cluster with a several SATAs and SSDs, so my real concern is - should I say to my boss: "Ok, seems like this is working fine, let's go ahead then"?
What do you think of it, colleagues? Would be very grateful for any shared knowledge.  

On Wed, May 13, 2015 at 11:21 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Don't let me deter you from pointing a gun at your foot. I think you may be confusing Ceph for a SAN, which it certainly is not. In an expensive SAN, 5 seconds it very unacceptable for a failover event between controllers. However there are many safeguards put in place to ensure that there is no corruption using custom communication channels and protocols to ensure fencing mis-behaving controllers.

Ceph is a distributed storage system built on commodity hardware and is striving to fill a different role. In the case of distributed systems you have to get a consensus of the state of each member to know what is really going on. Since there are more external forces in a distributed system, it usually can't be as quick to make decisions. Since the direct communication channels don't exist and things like flaky network cables, cards, loops, etc call all impact the ability of the cluster to have a good idea of the state of each node, you usually have to wait some time. This is partly due to the protocols used like TCP which has certain time-out periods, etc.

Recovery in Ceph is also more expensive than a SAN. SANs rely on controllers being able to access the same disks on the backend, Ceph replicates the data to provide availability. This can stress even very high bandwidth networks, disks, CPUs, etc. You don't want to jump into recovery too quickly as it can cause a cascading effect.

This gets to answering your question directly. If you have very short timing for detecting failures, then you can get into a feedback loop that takes your whole cluster down and it can't get healthy. If a node fails and recovery starts, other nodes may be too busy to respond to heart beats fast enough or the heartbeats are lost in transit and then they start getting marked down incorrectly and another round of recoveries start stressing the remaining OSDs causing a downward spiral. At the same time, OSDs that were wrongly marked down start responding and are brought back in the cluster and the recovery restarts and the you get this constant OSD flapping.

While 30 seconds seems like a long time, it is certainly better than being offline for a month due to corrupted data (true story, I have the scars to prove it). I make it a habit for any virtualized machines to increase the timeout of the file system to 300 seconds to help weather the storms that may happen anywhere in the pipe between the VM OS -> hypervisor -> storage system. 30 seconds is well within the time that the typical end user will just punch the refresh button if they are impatient and the page will load pretty quick after that and they won't think twice.

However, if you really need 5 seconds or less for failover of unexpected failures, I would suggest that you reevaluate the things that are important to you and the trade-offs you are willing to make. Although Ceph is not a perfect storage system, I have been very happy with the resiliency and performance it provides and at a wonderful price point. Even spending lots of $$ on Fibre Channel SANs does not guarantee great reliability or performance. I've had my fair share of outages on these where failovers did not work properly or some bug that has left their engineers scratching their heads for years, never to be fixed.

- ----------------
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, May 13, 2015 at 10:29 AM, Vasiliy Angapov  wrote:
Thanks, Sage!

In the meanwhile I asked the same question in #Ceph IRC channel and Be_El gave me exactly the same answer, which helped.
I also realized that in http://ceph.com/docs/master/rados/configuration/mon-osd-interaction/ it is stated: "You may change this grace period by adding an osd heartbeatgrace setting under the [osd] section of your Ceph configuration file, or by setting the value at runtime.". But in reality you must add this option to the [global] sections. Settinng this value in [osd] section only influenced only osd daemons, but not monitors.

Anyway, now IO resumes after only 5 seconds freeze. Thanks for help, guys!

Regarding Ceph failure detection: in real environment it seems for me like 20-30 seconds of freeze after a single storage node outage is very expensive.
Even when we talk about data consistency...  5 seconds is acceptable threshold.

But, Sage, can you please explain in brief, what are the drawbacks of lowering the timeout? If for example I got stable 10 gig cluster network which is not likely to lag or interrupt - is 5 seconds dangerous anyhow? How OSDs can report false positives in that case?

Thanks in advance :)

On Wed, May 13, 2015 at 7:05 PM, Sage Weil  wrote:
On Wed, 13 May 2015, Vasiliy Angapov wrote:
> Hi,
>
> Well, I've managed to find out that correct stop of osd causes no IO
> downtime (/etc/init.d/ceph stop osd). But that cannot be called a fault
> tolerance, which Ceph is supposed to be.However, "killall -9 ceph-osd"
> causes IO to stop for about 20 seconds.
>
> I've tried lowering some timeouts but without luck. Here is a related part
> of my ceph.conf after lowering the timeout values:
>
> [global]
> heartbeat interval = 5
> mon osd down out interval = 90
> mon pg warn max per osd = 2000
> mon osd adjust heartbeat grace = false
>
> [client]
> rbd cache = false
>
> [mon]
> mon clock drift allowed = .200
> mon osd min down reports = 1
>
> [osd]
> osd heartbeat interval = 3
> osd heartbeat grace = 5
>
> Can you help me to reduce IO downtime somehow? Because 20 seconds for
> production is just horrible.

You'll need to restart ceph-osd daemons for that change to take effect, or

 ceph tell osd.\* injectargs '--osd-heartbeat-grace 5 --osd-heartbeat-interval 1'

Just remember that this timeout is a tradeoff against false positives--be
careful tuning it too low.

Note that ext4 going ro after 5 seconds sounds like insanity to me.  I've
only seen this with older guest kernels, and iirc the problem is a
120s timeout with ide or something?

Ceph is a CP system that trades availability for consistency--it will
block IO as needed to ensure that it is handling reads or writes in a
completely consistent manner.  Even if you get the failure detection
latency down, other recovery scenarios are likely to cross the magic 5s
threshold at some point and cause the same problem.  You need to fix your
guests one way or another!

sage

>
> Regards, Vasily.
>
>
> On Wed, May 13, 2015 at 9:57 AM, Vasiliy Angapov  wrote:
>       Thanks, Gregory!
> My Ceph version is 0.94.1. What I'm trying to test is the worst
> situation when the node is loosing network or becomes inresponsive. So
> what i do is "killall -9 ceph-osd", then reboot.
>
> Well, I also tried to do a clean reboot several times (just a "reboot"
> command), but i saw no difference - there is always an IO freeze for
> about 30 seconds. Btw, i'm using Fedora 20 on all nodes.
>
> Ok, I will play with timeouts more.
>
> Thanks again!
>
> On Wed, May 13, 2015 at 10:46 AM, Gregory Farnum 
> wrote:
>       On Tue, May 12, 2015 at 11:39 PM, Vasiliy Angapov
>        wrote:
>       > Hi, colleagues!
>       >
>       > I'm testing a simple Ceph cluster in order to use it in
>       production
>       > environment. I have 8 OSDs (1Tb SATA  drives) which are
>       evenly distributed
>       > between 4 nodes.
>       >
>       > I'v mapped rbd image on the client node and started
>       writing a lot of data to
>       > it. Then I just reboot one node and see what's
>       happening. What happens is
>       > very sad. I have a write freeze for about 20-30 seconds
>       which is enough for
>       > ext4 filesystem to switch to RO.
>       >
>       > I wonder, if there is any way to minimize this lag?
>       AFAIK, ext filesystems
>       > have 5 seconds timeout before switching to RO. So is
>       there any way to get
>       > that lag beyond 5 secs? I've tried lowering different
>       osd timeouts, but it
>       > doesn't seem to help.
>       >
>       > How do you deal with such a situations? 20 seconds of
>       downtime is not
>       > tolerable in production.
>
> What version of Ceph are you running, and how are you rebooting
> it?
> Any newish version that gets a clean reboot will notify the
> cluster
> that it's shutting down, so you shouldn't witness blocked rights
> really at all.
>
> If you're doing a reboot that involves just ending the daemon,
> you
> will have to wait through the timeout period before the OSD gets
> marked down, which defaults to 30 seconds. This is adjustable
> (look
> for docs on the "osd heartbeat grace" config option), although
> if you
> set it too low you'll need to change a bunch of other timeouts
> which I
> don't know off-hand...
> -Greg
>
>
>
>
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVU6QiCRDmVDuy+mK58QAAYUEP/AlFtupR0HhnR/+1bZO0
1kbhLlwLpIZw0OYDe6OQiBBc8UKF49H8jG4ddkmNqsdNxASc6QuRMNRnSipP
+mxdOGUDXALvM0X83Dvf8oTbkTw11ofCFHXD6xV5Nua4OHevswC0Lo13rO4f
nzk4P3onr1CrqPHhIuGvLxMkhizHXooJvsnvKdzkXMbGoYY05tqC1++WB9NN
gmy5ABBaCrAIXnveBD35ZRHPTnRrzNGkQZQ3+ICQYWVCFqj8AzCcLwuyKgnc
bkwkCc34XOfUrrmByFM58O6+etu6AjyOEcCZh9W5Niwoy99qQPkiXa8RyJ3A
pd5M9v9AqXCSHE+DDyVATZkmLBiqS/SeQiV8Rrz7siAb07Wuc5jDgQwntYnu
a9ziNXlH2v3hK9gUEdGDKw1HDf53NnpPTj9cHbA4zwUt4v5LJX4FOOYYV19W
te3TrDigouxI9Z4m9QHItxNUBAHIYt2IAug0BdhYXgBMRZjebsDCGfYq0qnJ
K5aXSK1WgcseoOC4B6BZofNFsWen46HPk+bekiBfa87nBjBZB8XLnl7Q9uxe
m+faknIRVjZKRAvB+CeegT7lVupEG56IkCOjZjnwXIDfyzidLbKIn3oAJ8Ld
oG0oF2pspQ0gfvDhb2tvQnBtBZj43f3lLAkR1URv1hbqyEl97tDnf2OOQJp2
MvnZ
=lnv1
-----END PGP SIGNATURE-----

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com