Re: adding cache tier in productive hammer environment

Christian Balzer <chibi@xxxxxxx> · Mon, 11 Apr 2016 12:06:01 +0900

Hello,

On Sat, 9 Apr 2016 02:14:45 +0200 Oliver Dzombic wrote:

> Hi Christian,
> 
> yeah i saw the problems with cache tier in the current hammer.
> 
> But as far as i saw it, i would not get in touch with that szenarios. I
> dont plan to change settings like that, to let it go rubbish.
>
Shouldn't, but I'd avoid it anyway.

> But i already decided to wait for jewel and create a whole new cluster
> and copy all data.
> 
Sounds like a safer alternative.

> -
> 
> I am running KVM instances. And will also run openVZ instances. Maybe
> LXC too, lets see. They run all kind of different, independent
> applications.
> 
> -
> 
> Well i have to admit, the beginnings of the cluster were quiet
> experimentel. Was using 4x ( 2x 2,3 GHz Intel Celeron CPU's for 3x 6 TB
> HDD + 80 GB SSD with 16 GB RAM ). And extending it by 2 Additional of
> that kind. And currently also an E3-1225v5 with 32 GB RAM and 10x 3 TB
> HDD and 2x 120 GB SSD.
> 
Would you mind sharing what exact models of HDDs and SSDs you're using?
Also, is the newer node showing the same ratio of unresponsive OSDs as the
older ones?

In the atop output you posted, which ones are the SSDs (if they're in
there at all)?

> But all my munin tells me its HDD related, if you want i can show it to
> you. I guess that the hardcore random access on the drives are just
> killing it.
> 
Yup, I've seen that with the "bad" cluster here, the first thing to
indicate things were getting to the edge of IOPS capacity was that
deep-scrubs killed performance and then even regular scrubs.

> I deactivated (deep) scrub also because of this problem and just let it
> run in the night, like now, and having 90% utilization @ journals and
> 97% utilization @ HDD's.
> 
This confuses me as well, during deep-scrubs all data gets read, your
journals shouldn't get busier than they were before and last time you
mentioned them being around 60% or so?

> And yes, its simply fixed by restarting the OSD's.
> 
> They receive a heartbeat timeout and just go out/down.
> 
Which timeout is it, the peer one or the monitor one?
Have you tried upping the various parameters to prevent this?

> I tried to set the flag, that there will be no out/down.
> That worked. It did not got marked out/down, but it anyway happend and
> the cluster got instable ( misplaced object / recovery ).
> 
That's a band-aid indeed, but I wouldn't expect misplaced objects from it.

> Well as i see the situation in a case a VM has a file open and using it
> "right now", which is located on a PG on that OSD which is going "right
> now" down/out, then Filesystem of the VM will get in trouble.
> 
> It will see a bus error. Depending on the amount of data which are on
> the OSD, and depending how many VM's are "right now" accessing their
> data, you will have a lot of VM's which will receive a bus error.
> 
> But the fun goes even worst. It can, and will in most cases happen on
> linux operating systems, that such situation will cause an automatical
> read-only mount of the root partition. And this way, of course, the
> server will basically stop doing its job.
> 
> And, since this is not fucky enough, as long as you dont have your OSD
> back up and in, the VM's will not reboot.
> 
> Maybe because all is simply too slow, or maybe you had the luck that the
> primary OSD was going down and before this is not rebalanced, this VM
> will have IO Errors and not able to access its HDD.
> 
> I was already writing this here:
> 
> http://article.gmane.org/gmane.comp.file-systems.ceph.user/27899/match=data+inaccessable+after+single+osd+down+default+size+3+min+1
> 
> but didnt got any reaction on it.
> 
> As i see ceph its resilence are more focused, that you primary dont
> loose data in cases of (hardware) failure. And that you are able to use
> at least some part of your infrastructure, until the other part is
> recovered.
> 
> Based on my experience, ceph is not very resilent, when it comes to the
> data which are right right to that time accessed, when the hardware
> failure occurs.
> 
People (including me) have seen things like that before and there are
definitely places in Ceph where behavior like this can/needs to be
improved.
However this is also very dependent on your timeouts (in case of
unexpected OSD failures) and the loading of your cluster (how long does it
take for the PGs to re-peer, etc). 

> But, to be fair, two points are very important:
> 
> 1. my ceph config is for sure very basic, and has for gurantee some
> space for improvement
> 
> 2. windows filesystems are able to handle that situation much better.
> 
> While linux filesystems will in many cases have an
> 
> errors=remount-ro in their /etc/fstab by default
> 
> mostly seen on debian based distributions, but also others.
> This will cause to auto remount RO with that BUS errors.
> 
> Windows OS's does not have that. As i can see, Centos/Redhat doesn't
> have it too. So these will just continue to work.
> 
> So in the very end, thats not ceph's fault ( only ).
> 
> But as it seems, read/write's which were active in the moment of
> hardware failure are not in all cases buffered and sended again, in case
> of hardware failure.
> 
> ----
> 
> So far my experience, and theoretical thinkings about what i see.
> 
> On the other side, if i regulary simply turn off the node or pull the
> network cable, there will not be bus errors. So ceph can handle this
> szenario's better.
> 
> But especially when HDD's are going out/down because of heartbeat
> timeout, it seems, ceph is not that resilent.
> 
As I said, load plays a factor, if it clearly can't communicate with an
OSD or if that OSD is very slow to respond are different things.

> ----
> 
> From my experience, the more IO Wait the server has, the higher is the
> risk of OSD's are going down. Thats all what i could see when it comes
> to discernible pattern.
> 
> Right now, scrubbing is running. I have no special settings for this.
> All standard.
> 
Change that, especially the sleep time.

> Atop looks like that:
> 
> http://pastebin.com/mubaZbk2
> 
That looks a lot more reasonable than the 100ms+ times you quoted from
Munin, pretty typical for very busy HDDs.
Again, which ones are the SSDs?

> 
> ----
> 
> The distance of that two datacenters are some km ( within same city ).
> 
> Latency is:
> 
> rtt min/avg/max/mdev = 0.528/0.876/1.587/0.333 ms
> 
> So should not be a big issue. But that will be changed.
> 
> ----
> 
> The plan is, that the cachepool will be ~ 1 TB against 30-60 TB Raw HDD
> capacity, on each of 2-3 OSD nodes.
> 
> As i see the situation that should be fine enough to end this random
> access stuff, making it a more linear stream going to and from the cold
> HDDs.
> 
That's still a tad random once the cache gets full and starts flushing and
at least 4MB(one object) per write. 
But yes, at least for me the backing storage has no issues now with both
promotions and flushing happening.

> In any situation, with cache the situation can just improve. If i see,
> that the hot cache is instantly full, i will have to change the strategy
> again. But as i see the situation, each VM might have maybe up to 1-5%
> "hot" data in avarage.
> 
It will definitely improve things, but to optimize you cluster performance
you're probably best of with very aggressive read-recency settings or
readforward cache mode.

Christian
> So i think / hope, things can just become better with some faster drives
> in between.
> 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com