Re: adding cache tier in productive hammer environment

Oliver Dzombic <info@xxxxxxxxxxxxxxxxx> · Mon, 11 Apr 2016 22:45:00 +0200

Hi,

currently in use:

oldest:

SSDs: Intel S3510 80GB
HDD: HGST 6TB H3IKNAS600012872SE NAS

latest:

SSDs: Kingston 120 GB SV300
HDDs: HGST 3TB H3IKNAS30003272SE NAS

in future will be in use:

SSDs: Samsung SM863 240 GB
HDDs: HGST 3TB H3IKNAS30003272SE NAS and/or
Seagate ST2000NM0023 2 TB

-----

Its hard to say, if and at what chance the newer will fail with the
osd's getting down/out, compared to the old ones.

We did a lot to avoid that.

Without having it in real numbers, my feeling is/was that the newer will
fail with a much lower chance. But what is responsible for that, is unknown.

In the very end, the old nodes have with 2x 2,3 GHz Intel Celeron ( 2x
cores without HT ) and 3x 6 TB HDD much less cpu power per HDD compared
to the 4x 3,3 GHz Intel E3-1225v5 CPU ( 4 cores ) with 10x 3 TB HDD.

So its just too much different, CPU, HDD, RAM, even the HDD Controller.

I will have to make sure, that the new cluster will have enough Hardware
to make sure, that i dont need to consider possible problems there.

------

atop: sda/sdb == SSD journal

------

That was my first experience too. At very first, deep-scrubs and even
normal scrubs were driving the %WA and business of the HDDs to 100% Flat.

------

I rechecked it with munin.

The journal SSD's go from ~40% up to 80-90% during deebscrub.
The HDDs go from ~ 20% up to 90-100% flat more or less, during deebscrubt.

At the same time, the load avarage goes to 16-20 ( 4 cores )
while the CPU will see up to 318% Idle Waiting Time. ( out of max. 400% )

------

The OSD's receive a peer timeout. Which is just understandable if the
system will see a 300% Idle Waiting Time for just long enough.

------

And yes, as it seems, for clusters which are very busy, especially with
low hardware ressources, needs much more than the standard config
can/will deliver. As soon as the LTS is out i will have to start busting
my head with available config parameters.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

Am 11.04.2016 um 05:06 schrieb Christian Balzer:
> 
> Hello,
> 
> On Sat, 9 Apr 2016 02:14:45 +0200 Oliver Dzombic wrote:
> 
>> Hi Christian,
>>
>> yeah i saw the problems with cache tier in the current hammer.
>>
>> But as far as i saw it, i would not get in touch with that szenarios. I
>> dont plan to change settings like that, to let it go rubbish.
>>
> Shouldn't, but I'd avoid it anyway.
>  
>> But i already decided to wait for jewel and create a whole new cluster
>> and copy all data.
>>
> Sounds like a safer alternative.
> 
>> -
>>
>> I am running KVM instances. And will also run openVZ instances. Maybe
>> LXC too, lets see. They run all kind of different, independent
>> applications.
>>
>> -
>>
>> Well i have to admit, the beginnings of the cluster were quiet
>> experimentel. Was using 4x ( 2x 2,3 GHz Intel Celeron CPU's for 3x 6 TB
>> HDD + 80 GB SSD with 16 GB RAM ). And extending it by 2 Additional of
>> that kind. And currently also an E3-1225v5 with 32 GB RAM and 10x 3 TB
>> HDD and 2x 120 GB SSD.
>>
> Would you mind sharing what exact models of HDDs and SSDs you're using?
> Also, is the newer node showing the same ratio of unresponsive OSDs as the
> older ones?
> 
> In the atop output you posted, which ones are the SSDs (if they're in
> there at all)?
> 
>> But all my munin tells me its HDD related, if you want i can show it to
>> you. I guess that the hardcore random access on the drives are just
>> killing it.
>>
> Yup, I've seen that with the "bad" cluster here, the first thing to
> indicate things were getting to the edge of IOPS capacity was that
> deep-scrubs killed performance and then even regular scrubs.
> 
>> I deactivated (deep) scrub also because of this problem and just let it
>> run in the night, like now, and having 90% utilization @ journals and
>> 97% utilization @ HDD's.
>>
> This confuses me as well, during deep-scrubs all data gets read, your
> journals shouldn't get busier than they were before and last time you
> mentioned them being around 60% or so?
> 
>> And yes, its simply fixed by restarting the OSD's.
>>
>> They receive a heartbeat timeout and just go out/down.
>>
> Which timeout is it, the peer one or the monitor one?
> Have you tried upping the various parameters to prevent this?
> 
>> I tried to set the flag, that there will be no out/down.
>> That worked. It did not got marked out/down, but it anyway happend and
>> the cluster got instable ( misplaced object / recovery ).
>>
> That's a band-aid indeed, but I wouldn't expect misplaced objects from it.
> 
>> Well as i see the situation in a case a VM has a file open and using it
>> "right now", which is located on a PG on that OSD which is going "right
>> now" down/out, then Filesystem of the VM will get in trouble.
>>
>> It will see a bus error. Depending on the amount of data which are on
>> the OSD, and depending how many VM's are "right now" accessing their
>> data, you will have a lot of VM's which will receive a bus error.
>>
>> But the fun goes even worst. It can, and will in most cases happen on
>> linux operating systems, that such situation will cause an automatical
>> read-only mount of the root partition. And this way, of course, the
>> server will basically stop doing its job.
>>
>> And, since this is not fucky enough, as long as you dont have your OSD
>> back up and in, the VM's will not reboot.
>>
>> Maybe because all is simply too slow, or maybe you had the luck that the
>> primary OSD was going down and before this is not rebalanced, this VM
>> will have IO Errors and not able to access its HDD.
>>
>> I was already writing this here:
>>
>> http://article.gmane.org/gmane.comp.file-systems.ceph.user/27899/match=data+inaccessable+after+single+osd+down+default+size+3+min+1
>>
>> but didnt got any reaction on it.
>>
>> As i see ceph its resilence are more focused, that you primary dont
>> loose data in cases of (hardware) failure. And that you are able to use
>> at least some part of your infrastructure, until the other part is
>> recovered.
>>
>> Based on my experience, ceph is not very resilent, when it comes to the
>> data which are right right to that time accessed, when the hardware
>> failure occurs.
>>
> People (including me) have seen things like that before and there are
> definitely places in Ceph where behavior like this can/needs to be
> improved.
> However this is also very dependent on your timeouts (in case of
> unexpected OSD failures) and the loading of your cluster (how long does it
> take for the PGs to re-peer, etc). 
> 
>> But, to be fair, two points are very important:
>>
>> 1. my ceph config is for sure very basic, and has for gurantee some
>> space for improvement
>>
>> 2. windows filesystems are able to handle that situation much better.
>>
>> While linux filesystems will in many cases have an
>>
>> errors=remount-ro in their /etc/fstab by default
>>
>> mostly seen on debian based distributions, but also others.
>> This will cause to auto remount RO with that BUS errors.
>>
>> Windows OS's does not have that. As i can see, Centos/Redhat doesn't
>> have it too. So these will just continue to work.
>>
>> So in the very end, thats not ceph's fault ( only ).
>>
>> But as it seems, read/write's which were active in the moment of
>> hardware failure are not in all cases buffered and sended again, in case
>> of hardware failure.
>>
>> ----
>>
>> So far my experience, and theoretical thinkings about what i see.
>>
>> On the other side, if i regulary simply turn off the node or pull the
>> network cable, there will not be bus errors. So ceph can handle this
>> szenario's better.
>>
>> But especially when HDD's are going out/down because of heartbeat
>> timeout, it seems, ceph is not that resilent.
>>
> As I said, load plays a factor, if it clearly can't communicate with an
> OSD or if that OSD is very slow to respond are different things.
> 
>> ----
>>
>> From my experience, the more IO Wait the server has, the higher is the
>> risk of OSD's are going down. Thats all what i could see when it comes
>> to discernible pattern.
>>
>> Right now, scrubbing is running. I have no special settings for this.
>> All standard.
>>
> Change that, especially the sleep time.
> 
>> Atop looks like that:
>>
>> http://pastebin.com/mubaZbk2
>>
> That looks a lot more reasonable than the 100ms+ times you quoted from
> Munin, pretty typical for very busy HDDs.
> Again, which ones are the SSDs?
> 
>>
>> ----
>>
>> The distance of that two datacenters are some km ( within same city ).
>>
>> Latency is:
>>
>> rtt min/avg/max/mdev = 0.528/0.876/1.587/0.333 ms
>>
>> So should not be a big issue. But that will be changed.
>>
>> ----
>>
>> The plan is, that the cachepool will be ~ 1 TB against 30-60 TB Raw HDD
>> capacity, on each of 2-3 OSD nodes.
>>
>> As i see the situation that should be fine enough to end this random
>> access stuff, making it a more linear stream going to and from the cold
>> HDDs.
>>
> That's still a tad random once the cache gets full and starts flushing and
> at least 4MB(one object) per write. 
> But yes, at least for me the backing storage has no issues now with both
> promotions and flushing happening.
>  
>> In any situation, with cache the situation can just improve. If i see,
>> that the hot cache is instantly full, i will have to change the strategy
>> again. But as i see the situation, each VM might have maybe up to 1-5%
>> "hot" data in avarage.
>>
> It will definitely improve things, but to optimize you cluster performance
> you're probably best of with very aggressive read-recency settings or
> readforward cache mode.
> 
> Christian
>> So i think / hope, things can just become better with some faster drives
>> in between.
>>
>>
>>
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com