Re: adding cache tier in productive hammer environment

Christian Balzer <chibi@xxxxxxx> · Tue, 12 Apr 2016 09:24:54 +0900

Hello,

On Mon, 11 Apr 2016 22:45:00 +0200 Oliver Dzombic wrote:

> Hi,
> 
> currently in use:
> 
> oldest:
> 
> SSDs: Intel S3510 80GB
Ouch.
As in, not a speed wonder at 110MB/s writes (or 2 HDDs worth), but at
least suitable as a journal when it comes to sync writes.
But at 45TBW dangerously low in the in endurance department, I'd check
their wear-out constantly! See the recent thread:
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28083.html

> HDD: HGST 6TB H3IKNAS600012872SE NAS
HGST should be fine.

> 
> latest:
> 
> SSDs: Kingston 120 GB SV300
Don't know them, so no idea if they are suitable when it comes to sync
writes, but at 64TBW also in danger of expiring rather quickly.

> HDDs: HGST 3TB H3IKNAS30003272SE NAS
> 
> in future will be in use:
> 
> SSDs: Samsung SM863 240 GB
Those should be both suitable in the sync write and endurance department,
alas haven't tested them myself.

> HDDs: HGST 3TB H3IKNAS30003272SE NAS and/or
> Seagate ST2000NM0023 2 TB
> 
> 
> -----
> 
> Its hard to say, if and at what chance the newer will fail with the
> osd's getting down/out, compared to the old ones.
> 
> We did a lot to avoid that.
> 
> Without having it in real numbers, my feeling is/was that the newer will
> fail with a much lower chance. But what is responsible for that, is
> unknown.
> 
> In the very end, the old nodes have with 2x 2,3 GHz Intel Celeron ( 2x
> cores without HT ) and 3x 6 TB HDD much less cpu power per HDD compared
> to the 4x 3,3 GHz Intel E3-1225v5 CPU ( 4 cores ) with 10x 3 TB HDD.
> 
Yes, I'd suspect CPU exhaustion mostly here, aside from the IO overload.

On my massively underpowered test cluster I've been able to create OSD/MON
failures from exhausting CPU or RAM, on my production clusters never.

> So its just too much different, CPU, HDD, RAM, even the HDD Controller.
> 
> I will have to make sure, that the new cluster will have enough Hardware
> to make sure, that i dont need to consider possible problems there.
> 
> ------
> 
> atop: sda/sdb == SSD journal
>
Since there are 12 disk, I presume those are Kingston ones.
Frankly I wouldn't expect 10+ms waits from SSDs, but then again they are
90%ish busy when doing only 500IOPS and writing 1.5MB/s.
This indicates to me that they are NOT handling sync writes gracefully and
are not suitable as Ceph journals.

> ------
> 
> That was my first experience too. At very first, deep-scrubs and even
> normal scrubs were driving the %WA and business of the HDDs to 100% Flat.
> 
> ------
> 
> I rechecked it with munin.
> 
> The journal SSD's go from ~40% up to 80-90% during deebscrub.
I have no explanation for this, as deep-scrubbing introduces no writes.

> The HDDs go from ~ 20% up to 90-100% flat more or less, during
> deebscrubt.
> 
That's to be expected, again the sleep factor can reduce the impact for
client I/O immensely.

Christian
> At the same time, the load avarage goes to 16-20 ( 4 cores )
> while the CPU will see up to 318% Idle Waiting Time. ( out of max. 400% )
> 
> ------
> 
> The OSD's receive a peer timeout. Which is just understandable if the
> system will see a 300% Idle Waiting Time for just long enough.
> 
> 
> ------
> 
> And yes, as it seems, for clusters which are very busy, especially with
> low hardware ressources, needs much more than the standard config
> can/will deliver. As soon as the LTS is out i will have to start busting
> my head with available config parameters.
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com