Re: Cache tier experiences (for ample sized caches ^o^)

Christian Balzer <chibi@xxxxxxx> · Wed, 7 Oct 2015 16:04:40 +0900

Hello,

On Wed, 07 Oct 2015 07:34:16 +0200 Loic Dachary wrote:

> Hi Christian,
> 
> Interesting use case :-) How many OSDs / hosts do you have ? And how are
> they connected together ?
>
If you look far back in the archives you'd find that design.

And of course there will be a lot of "I told you so" comments, but it
worked just as planned while being within the design specifications. 

For example one of the first things I did was to have 64 VMs install
themselves automatically from a virtual CD-ROM in parallel. 
This Ceph cluster handled that w/o any slow requests and in decent time. 

To answer your question, just 2 nodes with 2 OSDs (RAID6 with a 4GB cache
Areca controller) each, replication of 2 obviously. 
Initially 3, now 6 compute nodes.
All interconnected via redundant 40Gb/s Infiniband (IPoIB), 2 ports per
server and 2 switches. 

While the low number of OSDs is obviously part of the problem here this is
masked by the journal SSDs and the large HW cache for the steady state. 
My revised design is 6 RAID10 OSDs per node, the change to RAID10 is
mostly to accommodate the type of VMs this cluster wasn't designed for in
the first place.

My main suspect for the excessive slowness are actually the Toshiba DT
type drives used. 
We only found out after deployment that these can go into a zombie mode
(20% of their usual performance for ~8 hours if not permanently until power
cycled) after a week of uptime.
Again, the HW cache is likely masking this for the steady state, but
asking a sick DT drive to seek (for reads) is just asking for trouble.

To illustrate this:
---
DSK |          sdd | busy     86% | read       0 | write     99 | avio 43.6 ms |
DSK |          sda | busy     12% | read       0 | write    151 | avio 4.13 ms |
DSK |          sdc | busy      8% | read       0 | write    139 | avio 2.82 ms |
DSK |          sdb | busy      7% | read       0 | write    132 | avio 2.70 ms |
---
The above is a snippet from atop on another machine here, the 4 disks are
in a RAID 10.
I'm sure you can guess which one is the DT01ACA200 drive, sdb and sdc are
Hitachi HDS723020BLA642 and sda is a Toshiba MG03ACA200.

I have another production cluster that originally only had just 3 nodes
and 8 OSDs each. 
It performed much better using MG drives.

So the new node I'm trying to phase has these MG HDDs and the older ones
will be replaced eventually.

Christian

[snip]

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com