Re: adding cache tier in productive hammer environment

Oliver Dzombic <info@xxxxxxxxxxxxxxxxx> · Sat, 9 Apr 2016 02:14:45 +0200

Hi Christian,

yeah i saw the problems with cache tier in the current hammer.

But as far as i saw it, i would not get in touch with that szenarios. I
dont plan to change settings like that, to let it go rubbish.

But i already decided to wait for jewel and create a whole new cluster
and copy all data.

-

I am running KVM instances. And will also run openVZ instances. Maybe
LXC too, lets see. They run all kind of different, independent applications.

-

Well i have to admit, the beginnings of the cluster were quiet
experimentel. Was using 4x ( 2x 2,3 GHz Intel Celeron CPU's for 3x 6 TB
HDD + 80 GB SSD with 16 GB RAM ). And extending it by 2 Additional of
that kind. And currently also an E3-1225v5 with 32 GB RAM and 10x 3 TB
HDD and 2x 120 GB SSD.

But all my munin tells me its HDD related, if you want i can show it to
you. I guess that the hardcore random access on the drives are just
killing it.

I deactivated (deep) scrub also because of this problem and just let it
run in the night, like now, and having 90% utilization @ journals and
97% utilization @ HDD's.

And yes, its simply fixed by restarting the OSD's.

They receive a heartbeat timeout and just go out/down.

I tried to set the flag, that there will be no out/down.
That worked. It did not got marked out/down, but it anyway happend and
the cluster got instable ( misplaced object / recovery ).

Well as i see the situation in a case a VM has a file open and using it
"right now", which is located on a PG on that OSD which is going "right
now" down/out, then Filesystem of the VM will get in trouble.

It will see a bus error. Depending on the amount of data which are on
the OSD, and depending how many VM's are "right now" accessing their
data, you will have a lot of VM's which will receive a bus error.

But the fun goes even worst. It can, and will in most cases happen on
linux operating systems, that such situation will cause an automatical
read-only mount of the root partition. And this way, of course, the
server will basically stop doing its job.

And, since this is not fucky enough, as long as you dont have your OSD
back up and in, the VM's will not reboot.

Maybe because all is simply too slow, or maybe you had the luck that the
primary OSD was going down and before this is not rebalanced, this VM
will have IO Errors and not able to access its HDD.

I was already writing this here:

http://article.gmane.org/gmane.comp.file-systems.ceph.user/27899/match=data+inaccessable+after+single+osd+down+default+size+3+min+1

but didnt got any reaction on it.

As i see ceph its resilence are more focused, that you primary dont
loose data in cases of (hardware) failure. And that you are able to use
at least some part of your infrastructure, until the other part is
recovered.

Based on my experience, ceph is not very resilent, when it comes to the
data which are right right to that time accessed, when the hardware
failure occurs.

But, to be fair, two points are very important:

1. my ceph config is for sure very basic, and has for gurantee some
space for improvement

2. windows filesystems are able to handle that situation much better.

While linux filesystems will in many cases have an

errors=remount-ro in their /etc/fstab by default

mostly seen on debian based distributions, but also others.
This will cause to auto remount RO with that BUS errors.

Windows OS's does not have that. As i can see, Centos/Redhat doesn't
have it too. So these will just continue to work.

So in the very end, thats not ceph's fault ( only ).

But as it seems, read/write's which were active in the moment of
hardware failure are not in all cases buffered and sended again, in case
of hardware failure.

----

So far my experience, and theoretical thinkings about what i see.

On the other side, if i regulary simply turn off the node or pull the
network cable, there will not be bus errors. So ceph can handle this
szenario's better.

But especially when HDD's are going out/down because of heartbeat
timeout, it seems, ceph is not that resilent.

----

>From my experience, the more IO Wait the server has, the higher is the
risk of OSD's are going down. Thats all what i could see when it comes
to discernible pattern.

Right now, scrubbing is running. I have no special settings for this.
All standard.

Atop looks like that:

http://pastebin.com/mubaZbk2

----

The distance of that two datacenters are some km ( within same city ).

Latency is:

rtt min/avg/max/mdev = 0.528/0.876/1.587/0.333 ms

So should not be a big issue. But that will be changed.

----

The plan is, that the cachepool will be ~ 1 TB against 30-60 TB Raw HDD
capacity, on each of 2-3 OSD nodes.

As i see the situation that should be fine enough to end this random
access stuff, making it a more linear stream going to and from the cold
HDDs.

In any situation, with cache the situation can just improve. If i see,
that the hot cache is instantly full, i will have to change the strategy
again. But as i see the situation, each VM might have maybe up to 1-5%
"hot" data in avarage.

So i think / hope, things can just become better with some faster drives
in between.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

Am 08.04.2016 um 09:43 schrieb Christian Balzer:
> 
> Hello,
> 
> On Thu, 7 Apr 2016 13:18:39 +0200 Oliver Dzombic wrote:
> 
>> Hi Christian,
>>
>> thank you for answering, i appriciate your time !
>>
> 
> One thing that should be obvious but I forgot in the previous mail was
> that of course you should probably wait for 0.94.7 or Jewel before doing
> anything with cache tiers.
> 
>> ---
>>
>> Its used for RBD hosted vm's and also cephfs hosted vm's.
>>
> I've been pondering CephFS for hosting Vserver or LXC based lightweight
> VMs, what are you running on top of it?
> 
>> Well the basic problem is/was that single OSD's simply go out/down.
> Even at the worst times I didn't manage for OSDs to actually get dropped. 
> I wonder if there is something more than simple I/O overload going on
> here, as in memory/CPU exhaustion or network issues. 
> Any clues to that in the logs?
> 
> Also is this simply fixed by restarting the OSD?
> 
>> Ending in SATA BUS error's for the VM's which have to be rebooted, if
>> they anyway can, because as long as OSD's are missing in that szenario,
>> the customer cant start their vm's.
>>
> This baffles me. 
> I could see things hang for the time it takes to mark that OSD down and
> out as described here:
> http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/
> 
> But if things still hang after that, something is very wrong.
>  
>> Installing/checking munin discovered a very high drive utilization. And
>> this way simply an overload of the cluster.
>>
> There is no discernible pattern of OSDs (or nodes) being more prone to go
> down?
> 
> Another thing to look into is the fragmentation of your OSDs.
> 
>> The initial setup was 4 nodes, with 4x mon and each 3x 6 TB HDD and 1x
>> SSD for journal.
>>
>> So i started to add more OSD's ( 2 nodes, with 3x 6 TB HDD and 1x SSD
>> for journal ). And, as first aid, reducing the replication from 3 to 2
>> to reduce the (write) load of the cluster.
>>
>> I planed to wait until the new LTS is out, but i already added now
>> another node with 10x 3 TB HDD and 2x SSD for journal and 2-3x SSD for
>> tier cache ( changing strategy and increasing the number of drives while
>> reducing the size - was an design mistake from me ).
>>
> More (independent of size) HDDs is better in terms of IOPS.
> 
>>  osdmap e31602: 28 osds: 28 up, 28 in
>>             flags noscrub,nodeep-scrub
>>       pgmap v13849513: 1428 pgs, 5 pools, 19418 GB data, 4932 kobjects
>>             39270 GB used, 88290 GB / 124 TB avail
>>                 1428 active+clean
>>
>>
>>
>> The range goes from 200 op/s to around 5000 op/s.
>>
>> The current avarage drive utilization is 20-30%.
>>
> Pretty similar to what I see on my cluster that is similar to yours (24
> OSDs with SSD journals on 4 hosts, replica 3) when I load it up to 5000
> op/s.
> And that should be just fine. 
> 
>> If we have backfill ( osd out/down ) or reweight the utilization of HDD
>> drives is streight 90-100%.
>>
> Not unsurprisingly and you already had some hints by Chris Taylor on how
> to reduce the impact.
> 
> That's probably also why you have scrubbing disabled, the reads (more than
> the writes) create a lot of contention (seeking) on the HDD OSDs.
> I assume you have played with the various settings to lessen the scrub
> impact, in particular:
> osd_scrub_start_hour
> osd_scrub_end_hour
> osd_scrub_sleep = 0.1
> osd scrub load threshold = 
> 
>> Munin shows on all drives ( except the SSD's ) a dislatency of avarage
>> 170 ms. A minumum  of 80-130 ms, and a maximum of 300-600ms.
>>
> Really? That's insanely high, I can't make the crappy HDDs in my crappy
> test cluster go over 24ms when they are at 107% utilization.
> This is however measured with atop, could you do a quick check with it as
> well?
> 
>> Currently, the 4 initial nodes are in datacenter A and the 3 other nodes
>> are, together with most of the VM's in datacenter B.
>>
> Distance between those?
> 
>> I am currently cleaning the 4 initial nodes by doing
>>
>> ceph osd reweight to peut a peut reducing the usage, to remove the osd's
>> completely from there and just keeping up the monitors.
>>
>> The complete cluster have to move to one single datacenter together with
>> all VM's.
>>
> OK, so the re-balancing traffic is what's killing you.
> 
>> ---
>>
>> I am reducing the number of nodes because out of administrative view,
>> its not very handy. I prefere extending the hardware power in terms of
>> CPU, RAM and HDD.
>>
> While understandable and certainly something I'm doing whenever possible
> as well, scaling up instead of out is a lot harder to get right with Ceph
> than other things.
> 
>> So the endcluster will look like:
>>
>> 3x OSD Nodes, each:
>>
>> 2x E5-2620v3 CPU, 128 GB RAM, 2x 10 Gbit Network, Adaptec HBA 1000-16e
>> to connect to external JBOD servers holding the cold storage HDD's.
>> Maybe ~ 24 drives in 2 or 3 TB SAS or SATA 7200 RPM's.
>>
>> I think SAS is, because of the reduces access times ( 4/5 ms vs. 10 ms )
>> very useful in a ceph environment. But then again, maybe with a cache
>> tier the impact/difference is not really that big.
>>
> This, like so many things, depends.
> If you're read-heavy and keep read-only objects out of your cache by using
> the readforward cache mode or aggressive read-recency settings, then the
> faster access times of SAS HDDs will help.
> Otherwise (write-heavy or having reads copied into the cache as well) not
> so much.
> 
>> That together with Samsung SM863 240 GB SSD's for journal and cache
>> tier, connected to the board directly or to a seperated Adaptec HBA
>> 1000-16i.
>>
> Out of the 5million objects (at 4MB each) you have, how many do you
> consider hot?
> It's a very difficult question if you're not aware of what is actually
> going on in your VMs (like do they write to the same status/DB files or
> are they creating "new" data and thus objects all the time?).
> But unless you can get at least close to that number in your cache pool,
> its usefulness will be diminished.
> 
> I had the luxury to deploy a cache pool of about 2.5TB capacity, nearly
> half the size of ALL the data in the cluster at that point, so I was very
> confident that it could hold the necessary objects. 
> As it turned out, only 5% of that capacity is actually needed to make
> things happy, but that's very specific to our use case (all the VMs run
> the same application).
> 
>> So far the current idea/theory/plan.
>>
>> ---
>>
>> But to that point, its a long road. Last night i was doing a reweight of
>> 3 OSD's from 1.0 to 0.9 ending up in one hdd was going down/out, so i
>> had to restart the osd. ( with again IO errors in some of the vm's ).
>>
>> So based on your article, the cache tier solved your problem, and i
>> think i have basically the same.
>>
> It can/will certainly mask the underlying problem, yes.
> 
> Christian
>> ---
>>
>> So a very good hint is, to activate the whole tier cache in the night,
>> when things are a bit more smooth.
>>
>> Any suggestions / critics / advices are highly welcome :-)
>>
>> Thank you!
>>
>>
> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com