Re: journal or cache tier on SSDs ?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

On Tue, 10 May 2016 13:14:35 +0200 Yoann Moulin wrote:

> Hello,
> 
> >> I'd like some advices about the setup of a new ceph cluster. Here the
> >> use case :
> >>
> >> RadowGW (S3 and maybe swift for hadoop/spark) will be the main usage.
> >> Most of the access will be in read only mode. Write access will only
> >> be done by the admin to update the datasets.
> >>
> >> We might use rbd some time to sync data as temp storage (when POSIX is
> >> needed) but performance will not be an issue here. We might use cephfs
> >> in the futur if that can replace a filesystem on rdb.
> >>
> >> We gonna start with 16 nodes (up to 24). The configuration of each
> >> node is :
> >>
> >> CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (12c/48t)
> >> Memory : 128GB
> >> OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1)
> > 
> > Dedicated OS SSDs aren't really needed, I tend to share OS and
> > cache/journal SSDs.
> > That's of course with more durable (S3610) models.
> 
> I already have those 24 servers running 2 ceph cluster for test right
> now, so I cannot change anything. we were thinking about share journal
> but as I mention it below, MON will be on storage server, so that might
> use too much I/O to share levedb and journal on the same SSD.
>
Not really, the journal is sequential writes, the leveldb small, fast
IOPS. Both of them on the same (decent) SSD should be fine.

But as your HW is fixed, lets not speculate about that.
 
> > Since you didn't mention dedicated MON nodes, make sure that if you
> > plan to put MONs on storage servers to have fast SSDs in them for the
> > leveldb (again DC S36xx or 37xx).
> 
> Yes MON nodes will be shared on storage server. MONs use the SSD 240GB
> for the leveldb right now.
> 
Note that the lowest IP(s) become the MON leader, so if you put RADOSGW
and other things on the storage nodes as well, spread things out
accordingly.

> > This will also free up 2 more slots in your (likely Supermicro) chassis
> > for OSD HDDs.
> 
> It's not supermicro enclosure, it's Intel one with 12 slot 3.5" front
> and 2 slots 2.5" back, so I cannot add more disk. the 240GB SSDs are in
> front.
>
That sounds like a SM chassis. ^o^
In fact, I can't find a chassis on Intel's page with 2 back 2.5 slots.
 
> >> Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid)
> > 
> > These SSDs do not exist according to the Intel site and the only
> > references I can find for them are on "no longer available" European
> > sites.
> 
> I made a mistake, it's not 400 but 480GB, smartctl give me Model
> SSDSC2BB480H4
>
OK, that's not good.
Firstly, that model number still doesn't get us any hits from Intel,
strangely enough.

Secondly, it is 480GB (instead of 400, which denotes overprovisioning) and
matches the 3510 480GB model up to the last 2 characters.
And that has an endurance of 275TBW, not something you want to use for
either journals or cache pools.
 
> > Since you're in the land of rich chocolate bankers, I assume that this
> > model is something that just happened in Europe.
> 
> I'm just a poor sysadmin with expensive toy in a University ;)
> 
I know, I recognized the domain. ^.^

> > Without knowing the specifications for these SSDs, I can't recommend
> > them. I'd use DC S3610 or 3710 instead, this very much depends on how
> > much endurance (TPW) you need.
> 
> As I write above, I already have those SSDs so I look for the best setup
> with the hardware I have.
> 

Unless they have at least an endurance of 3 DWPD like the 361x (and their
model number, size and the 3300 naming suggests they do NOT), your 480GB
SSDs aren't suited for intense Ceph usage.

How much have you used them yet and what is their smartctl status, in
particular these values (from a 800GB DC S3610 in my cache pool):
---
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       869293
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       43435
243 NAND_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1300884
---

Not even 1% down after 40TBW, at which point your SSDs are likely to be
15% down...


> >> OSD Disk : 10 x HGST ultrastar-7k6000 6TB
> >> Public Network : 1 x 10Gb/s
> >> Private Network : 1 x 10Gb/s
> >> OS : Ubuntu 16.04
> >> Ceph version : Jewel
> >>
> >> The question is : journal or cache tier (read only) on the SD 400GB
> >> Intel S3300 DC ?
> >>
> > You said read-only, or read-mostly up there. 
> 
> I mean, I think about using cache tier for read operation. No write
> operation gonna use the cache tier. I don't know yet wich mode I gonna
> use, I have to do some tests.
> 
As I said, your HDDs are unlikely to be slower (for sufficient parallel
access, not short, sequential reads) than those SSDs.
 
> > So why journals (only helpful for writes) or cache tiers (your 2 SSDs
> > may not be faster than your 10 HDDs for reads) at all?
> 
> We plan to have eavy read access some time so we think about to have
> cache tier on SSD to speed up the throughput and decrease the I/O
> pressure on disk. I might be wrong on that.
>
Unless it is repetitive reads that fit all into the cache, probably not.
Reads that need to be promoted to the cache are actually slower than
direct ones.
 
> > Mind, if you have the money, go for it!
> 
> I don't have the money, I have the hardware :)
> 
> >> Each disk is able to write sequentially at 220MB/s. SSDs can write at
> >> ~500MB/s. if we set 5 journals on each SSDs, SSD will still be the
> >> bottleneck (1GB/s vs 2GB/s). 
> > 
> > Your filestore based OSDs will never write Ceph data at 220MB/s, 100
> > would be pushing it. 
> > So no, your journal SSDs won't be the limiting factor, though 5
> > journals on one SSD is pushing my comfort zone when it comes to SPoFs. 
> > 
> >> If we set the journal on OSDs, we can
> >> expect a good throughput in read on the disk (in case of data not in
> >> the cache) and write shouldn't be so bad too, even if we have random
> >> read on the OSD during the write ?
> >>
> >> SSDs as cache tier seem to be a better usage than only 5 journal on
> >> each ? Is that correct ?
> >>
> > Potentially, depends on your actual usage.
> > 
> > Again, since you said read-mostly, the question with a cache-tier
> > becomes, how much of your truly hot data can fit into it?
> 
> That the biggest point, many datasets will fit into the cache, but some
> of them will definitely be too big (+100TB) but in that case, Our user
> know what going one.
>

With "correct" configuration of Jewel, you may be able to keep those huge
datasets out of the cache altogether. 
 
> > Remember that super-hot objects are likely to come from the pagecache
> > of the storage node in question anyway.
> 
> Yes I know that.
> 
> > If you do care about fast writes after all, consider de-coupling writes
> > and reads as much as possible.
> 
> Write operation will only be done by the admins for datasets update.
> those updates will be plan according the usage of the cluster and
> scheduled during low usage period.
> 

Good, so the scheme below might work for you, at least the flushing of
dirty data part.

> > As in, set your cache to "readforward" (undocumented, google for it),
> > so all un-cached reads will go to the HDDs (they CAN read at near full
> > speed), while all writes will go the cache pool (and eventually to the
> > HDDs, you can time that with lowering the dirty ratio during off-peak
> > hours).
> 
> I gonna give a look on that, thanks for the tips.
> 
> >> We gonna use an EC pool for big files (jerasure 8+2 I think) and a
> >> replicated pool for small files.
> >>
> >> If I check on http://ceph.com/pgcalc/, in this use case
> >>
> >> replicated pool: pg_num = 8192 for 160 OSDs but 16384 for 240 OSDs
> >> Ec pool : pg_num = 4096
> >> and pgp_num = pg_num
> >>
> >> Should I set the pg_num to 8192 or 16384 ? what is the impact on the
> >> cluster if we set the pg_num to 16384 at the beginning ? 16384 is
> >> high, isn't it ?
> >>
> > If 24 nodes is the absolute limit of your cluster, you want to set the
> > target pg num to 100 in the calculator, which gives you 8192 again.
> > 
> > Keep in mind that splitting PGs is an expensive operation, so if 24
> > isn't a hard upper limit, you might be better off starting big.
> 
> Yes, I did some test on that, it's definitely an expensive operation :)
> 
> Thanks for that real useful answer
> 

No worries, hopefully you can get some other SSDs for journals/cache pools.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux