Re: journal or cache tier on SSDs ?

Christian Balzer <chibi@xxxxxxx> · Tue, 10 May 2016 18:58:52 +0900

Hello,

On Tue, 10 May 2016 10:40:08 +0200 Yoann Moulin wrote:

> Hello,
> 
> I'd like some advices about the setup of a new ceph cluster. Here the
> use case :
> 
> RadowGW (S3 and maybe swift for hadoop/spark) will be the main usage.
> Most of the access will be in read only mode. Write access will only be
> done by the admin to update the datasets.
> 
> We might use rbd some time to sync data as temp storage (when POSIX is
> needed) but performance will not be an issue here. We might use cephfs
> in the futur if that can replace a filesystem on rdb.
> 
> We gonna start with 16 nodes (up to 24). The configuration of each node
> is :
> 
> CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (12c/48t)
> Memory : 128GB
> OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1)

Dedicated OS SSDs aren't really needed, I tend to share OS and
cache/journal SSDs.
That's of course with more durable (S3610) models.

Since you didn't mention dedicated MON nodes, make sure that if you plan
to put MONs on storage servers to have fast SSDs in them for the leveldb
(again DC S36xx or 37xx).

This will also free up 2 more slots in your (likely Supermicro) chassis
for OSD HDDs.

> Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid)

These SSDs do not exist according to the Intel site and the only
references I can find for them are on "no longer available" European sites.
Since you're in the land of rich chocolate bankers, I assume that this
model is something that just happened in Europe.

Without knowing the specifications for these SSDs, I can't recommend them.
I'd use DC S3610 or 3710 instead, this very much depends on how much
endurance (TPW) you need.

> OSD Disk : 10 x HGST ultrastar-7k6000 6TB
> Public Network : 1 x 10Gb/s
> Private Network : 1 x 10Gb/s
> OS : Ubuntu 16.04
> Ceph version : Jewel
> 
> The question is : journal or cache tier (read only) on the SD 400GB
> Intel S3300 DC ?
> 
You said read-only, or read-mostly up there. 

So why journals (only helpful for writes) or cache tiers (your 2 SSDs may
not be faster than your 10 HDDs for reads) at all?

Mind, if you have the money, go for it!

> Each disk is able to write sequentially at 220MB/s. SSDs can write at
> ~500MB/s. if we set 5 journals on each SSDs, SSD will still be the
> bottleneck (1GB/s vs 2GB/s). 

Your filestore based OSDs will never write Ceph data at 220MB/s, 100 would
be pushing it. 
So no, your journal SSDs won't be the limiting factor, though 5 journals
on one SSD is pushing my comfort zone when it comes to SPoFs. 

> If we set the journal on OSDs, we can
> expect a good throughput in read on the disk (in case of data not in the
> cache) and write shouldn't be so bad too, even if we have random read on
> the OSD during the write ?
> 
> SSDs as cache tier seem to be a better usage than only 5 journal on
> each ? Is that correct ?
> 
Potentially, depends on your actual usage.

Again, since you said read-mostly, the question with a cache-tier becomes,
how much of your truly hot data can fit into it?

Remember that super-hot objects are likely to come from the pagecache of
the storage node in question anyway.

If you do care about fast writes after all, consider de-coupling writes
and reads as much as possible.
As in, set your cache to "readforward" (undocumented, google for it), so
all un-cached reads will go to the HDDs (they CAN read at near full speed),
while all writes will go the cache pool (and eventually to the HDDs, you
can time that with lowering the dirty ratio during off-peak hours).

> We gonna use an EC pool for big files (jerasure 8+2 I think) and a
> replicated pool for small files.
> 
> If I check on http://ceph.com/pgcalc/, in this use case
> 
> replicated pool: pg_num = 8192 for 160 OSDs but 16384 for 240 OSDs
> Ec pool : pg_num = 4096
> and pgp_num = pg_num
> 
> Should I set the pg_num to 8192 or 16384 ? what is the impact on the
> cluster if we set the pg_num to 16384 at the beginning ? 16384 is high,
> isn't it ?
> 
If 24 nodes is the absolute limit of your cluster, you want to set the
target pg num to 100 in the calculator, which gives you 8192 again.

Keep in mind that splitting PGs is an expensive operation, so if 24 isn't
a hard upper limit, you might be better off starting big.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com