Re: adding cache tier in productive hammer environment

Christian Balzer <chibi@xxxxxxx> · Fri, 8 Apr 2016 16:43:52 +0900

Hello,

On Thu, 7 Apr 2016 13:18:39 +0200 Oliver Dzombic wrote:

> Hi Christian,
> 
> thank you for answering, i appriciate your time !
> 

One thing that should be obvious but I forgot in the previous mail was
that of course you should probably wait for 0.94.7 or Jewel before doing
anything with cache tiers.

> ---
> 
> Its used for RBD hosted vm's and also cephfs hosted vm's.
> 
I've been pondering CephFS for hosting Vserver or LXC based lightweight
VMs, what are you running on top of it?

> Well the basic problem is/was that single OSD's simply go out/down.
Even at the worst times I didn't manage for OSDs to actually get dropped. 
I wonder if there is something more than simple I/O overload going on
here, as in memory/CPU exhaustion or network issues. 
Any clues to that in the logs?

Also is this simply fixed by restarting the OSD?

> Ending in SATA BUS error's for the VM's which have to be rebooted, if
> they anyway can, because as long as OSD's are missing in that szenario,
> the customer cant start their vm's.
>
This baffles me. 
I could see things hang for the time it takes to mark that OSD down and
out as described here:
http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/

But if things still hang after that, something is very wrong.

> Installing/checking munin discovered a very high drive utilization. And
> this way simply an overload of the cluster.
> 
There is no discernible pattern of OSDs (or nodes) being more prone to go
down?

Another thing to look into is the fragmentation of your OSDs.

> The initial setup was 4 nodes, with 4x mon and each 3x 6 TB HDD and 1x
> SSD for journal.
> 
> So i started to add more OSD's ( 2 nodes, with 3x 6 TB HDD and 1x SSD
> for journal ). And, as first aid, reducing the replication from 3 to 2
> to reduce the (write) load of the cluster.
> 
> I planed to wait until the new LTS is out, but i already added now
> another node with 10x 3 TB HDD and 2x SSD for journal and 2-3x SSD for
> tier cache ( changing strategy and increasing the number of drives while
> reducing the size - was an design mistake from me ).
> 
More (independent of size) HDDs is better in terms of IOPS.

>  osdmap e31602: 28 osds: 28 up, 28 in
>             flags noscrub,nodeep-scrub
>       pgmap v13849513: 1428 pgs, 5 pools, 19418 GB data, 4932 kobjects
>             39270 GB used, 88290 GB / 124 TB avail
>                 1428 active+clean
> 
> 
> 
> The range goes from 200 op/s to around 5000 op/s.
>
> The current avarage drive utilization is 20-30%.
>
Pretty similar to what I see on my cluster that is similar to yours (24
OSDs with SSD journals on 4 hosts, replica 3) when I load it up to 5000
op/s.
And that should be just fine. 

> If we have backfill ( osd out/down ) or reweight the utilization of HDD
> drives is streight 90-100%.
> 
Not unsurprisingly and you already had some hints by Chris Taylor on how
to reduce the impact.

That's probably also why you have scrubbing disabled, the reads (more than
the writes) create a lot of contention (seeking) on the HDD OSDs.
I assume you have played with the various settings to lessen the scrub
impact, in particular:
osd_scrub_start_hour
osd_scrub_end_hour
osd_scrub_sleep = 0.1
osd scrub load threshold = 

> Munin shows on all drives ( except the SSD's ) a dislatency of avarage
> 170 ms. A minumum  of 80-130 ms, and a maximum of 300-600ms.
> 
Really? That's insanely high, I can't make the crappy HDDs in my crappy
test cluster go over 24ms when they are at 107% utilization.
This is however measured with atop, could you do a quick check with it as
well?

> Currently, the 4 initial nodes are in datacenter A and the 3 other nodes
> are, together with most of the VM's in datacenter B.
> 
Distance between those?

> I am currently cleaning the 4 initial nodes by doing
> 
> ceph osd reweight to peut a peut reducing the usage, to remove the osd's
> completely from there and just keeping up the monitors.
> 
> The complete cluster have to move to one single datacenter together with
> all VM's.
> 
OK, so the re-balancing traffic is what's killing you.

> ---
> 
> I am reducing the number of nodes because out of administrative view,
> its not very handy. I prefere extending the hardware power in terms of
> CPU, RAM and HDD.
> 
While understandable and certainly something I'm doing whenever possible
as well, scaling up instead of out is a lot harder to get right with Ceph
than other things.

> So the endcluster will look like:
> 
> 3x OSD Nodes, each:
> 
> 2x E5-2620v3 CPU, 128 GB RAM, 2x 10 Gbit Network, Adaptec HBA 1000-16e
> to connect to external JBOD servers holding the cold storage HDD's.
> Maybe ~ 24 drives in 2 or 3 TB SAS or SATA 7200 RPM's.
> 
> I think SAS is, because of the reduces access times ( 4/5 ms vs. 10 ms )
> very useful in a ceph environment. But then again, maybe with a cache
> tier the impact/difference is not really that big.
> 
This, like so many things, depends.
If you're read-heavy and keep read-only objects out of your cache by using
the readforward cache mode or aggressive read-recency settings, then the
faster access times of SAS HDDs will help.
Otherwise (write-heavy or having reads copied into the cache as well) not
so much.

> That together with Samsung SM863 240 GB SSD's for journal and cache
> tier, connected to the board directly or to a seperated Adaptec HBA
> 1000-16i.
> 
Out of the 5million objects (at 4MB each) you have, how many do you
consider hot?
It's a very difficult question if you're not aware of what is actually
going on in your VMs (like do they write to the same status/DB files or
are they creating "new" data and thus objects all the time?).
But unless you can get at least close to that number in your cache pool,
its usefulness will be diminished.

I had the luxury to deploy a cache pool of about 2.5TB capacity, nearly
half the size of ALL the data in the cluster at that point, so I was very
confident that it could hold the necessary objects. 
As it turned out, only 5% of that capacity is actually needed to make
things happy, but that's very specific to our use case (all the VMs run
the same application).

> So far the current idea/theory/plan.
> 
> ---
> 
> But to that point, its a long road. Last night i was doing a reweight of
> 3 OSD's from 1.0 to 0.9 ending up in one hdd was going down/out, so i
> had to restart the osd. ( with again IO errors in some of the vm's ).
> 
> So based on your article, the cache tier solved your problem, and i
> think i have basically the same.
> 
It can/will certainly mask the underlying problem, yes.

Christian
> ---
> 
> So a very good hint is, to activate the whole tier cache in the night,
> when things are a bit more smooth.
> 
> Any suggestions / critics / advices are highly welcome :-)
> 
> Thank you!
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com