Re, >>>> I'd like some advices about the setup of a new ceph cluster. Here the >>>> use case : >>>> >>>> RadowGW (S3 and maybe swift for hadoop/spark) will be the main usage. >>>> Most of the access will be in read only mode. Write access will only >>>> be done by the admin to update the datasets. >>>> >>>> We might use rbd some time to sync data as temp storage (when POSIX is >>>> needed) but performance will not be an issue here. We might use cephfs >>>> in the futur if that can replace a filesystem on rdb. >>>> >>>> We gonna start with 16 nodes (up to 24). The configuration of each >>>> node is : >>>> >>>> CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (12c/48t) >>>> Memory : 128GB >>>> OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1) >>> >>> Dedicated OS SSDs aren't really needed, I tend to share OS and >>> cache/journal SSDs. >>> That's of course with more durable (S3610) models. >> >> I already have those 24 servers running 2 ceph cluster for test right >> now, so I cannot change anything. we were thinking about share journal >> but as I mention it below, MON will be on storage server, so that might >> use too much I/O to share levedb and journal on the same SSD. >> > Not really, the journal is sequential writes, the leveldb small, fast > IOPS. Both of them on the same (decent) SSD should be fine. > > But as your HW is fixed, lets not speculate about that. Ok. >>> Since you didn't mention dedicated MON nodes, make sure that if you >>> plan to put MONs on storage servers to have fast SSDs in them for the >>> leveldb (again DC S36xx or 37xx). >> >> Yes MON nodes will be shared on storage server. MONs use the SSD 240GB >> for the leveldb right now. >> > Note that the lowest IP(s) become the MON leader, so if you put RADOSGW > and other things on the storage nodes as well, spread things out > accordingly. Yes for sur, we gonna spread services over nodes. The 3 RadosGW won't be on the MONs nodes. >>> This will also free up 2 more slots in your (likely Supermicro) chassis >>> for OSD HDDs. >> >> It's not supermicro enclosure, it's Intel one with 12 slot 3.5" front >> and 2 slots 2.5" back, so I cannot add more disk. the 240GB SSDs are in >> front. > > That sounds like a SM chassis. ^o^ > In fact, I can't find a chassis on Intel's page with 2 back 2.5 slots. http://www.colfax-intl.com/nd/images/systems/servers/R2208WT-rear.gif >>>> Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid) >>> >>> These SSDs do not exist according to the Intel site and the only >>> references I can find for them are on "no longer available" European >>> sites. >> >> I made a mistake, it's not 400 but 480GB, smartctl give me Model >> SSDSC2BB480H4 >> > OK, that's not good. > Firstly, that model number still doesn't get us any hits from Intel, > strangely enough. > > Secondly, it is 480GB (instead of 400, which denotes overprovisioning) and > matches the 3510 480GB model up to the last 2 characters. > And that has an endurance of 275TBW, not something you want to use for > either journals or cache pools. I see, here the information from the resseler : "The S3300 series is the OEM version of S3510 and 1:1 the same drive" >>> Since you're in the land of rich chocolate bankers, I assume that this >>> model is something that just happened in Europe. >> >> I'm just a poor sysadmin with expensive toy in a University ;) >> > I know, I recognized the domain. ^.^ :) >>> Without knowing the specifications for these SSDs, I can't recommend >>> them. I'd use DC S3610 or 3710 instead, this very much depends on how >>> much endurance (TPW) you need. >> >> As I write above, I already have those SSDs so I look for the best setup >> with the hardware I have. >> > > Unless they have at least an endurance of 3 DWPD like the 361x (and their > model number, size and the 3300 naming suggests they do NOT), your 480GB > SSDs aren't suited for intense Ceph usage. > > How much have you used them yet and what is their smartctl status, in > particular these values (from a 800GB DC S3610 in my cache pool): > --- > 232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0 > 233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0 > 241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 869293 > 242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 43435 > 243 NAND_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1300884 > --- > > Not even 1% down after 40TBW, at which point your SSDs are likely to be > 15% down... More or less the same value on the 10 hosts I have on my beta cluster : 232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0 233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0 241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 233252 242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 13 >>>> OSD Disk : 10 x HGST ultrastar-7k6000 6TB >>>> Public Network : 1 x 10Gb/s >>>> Private Network : 1 x 10Gb/s >>>> OS : Ubuntu 16.04 >>>> Ceph version : Jewel >>>> >>>> The question is : journal or cache tier (read only) on the SD 400GB >>>> Intel S3300 DC ? >>>> >>> You said read-only, or read-mostly up there. >> >> I mean, I think about using cache tier for read operation. No write >> operation gonna use the cache tier. I don't know yet wich mode I gonna >> use, I have to do some tests. >> > As I said, your HDDs are unlikely to be slower (for sufficient parallel > access, not short, sequential reads) than those SSDs. Ok >>> So why journals (only helpful for writes) or cache tiers (your 2 SSDs >>> may not be faster than your 10 HDDs for reads) at all? >> >> We plan to have eavy read access some time so we think about to have >> cache tier on SSD to speed up the throughput and decrease the I/O >> pressure on disk. I might be wrong on that. >> > Unless it is repetitive reads that fit all into the cache, probably not. > Reads that need to be promoted to the cache are actually slower than > direct ones. make sense >>> Mind, if you have the money, go for it! >> >> I don't have the money, I have the hardware :) >> >>>> Each disk is able to write sequentially at 220MB/s. SSDs can write at >>>> ~500MB/s. if we set 5 journals on each SSDs, SSD will still be the >>>> bottleneck (1GB/s vs 2GB/s). >>> >>> Your filestore based OSDs will never write Ceph data at 220MB/s, 100 >>> would be pushing it. >>> So no, your journal SSDs won't be the limiting factor, though 5 >>> journals on one SSD is pushing my comfort zone when it comes to SPoFs. >>> >>>> If we set the journal on OSDs, we can >>>> expect a good throughput in read on the disk (in case of data not in >>>> the cache) and write shouldn't be so bad too, even if we have random >>>> read on the OSD during the write ? >>>> >>>> SSDs as cache tier seem to be a better usage than only 5 journal on >>>> each ? Is that correct ? >>>> >>> Potentially, depends on your actual usage. >>> >>> Again, since you said read-mostly, the question with a cache-tier >>> becomes, how much of your truly hot data can fit into it? >> >> That the biggest point, many datasets will fit into the cache, but some >> of them will definitely be too big (+100TB) but in that case, Our user >> know what going one. >> > > With "correct" configuration of Jewel, you may be able to keep those huge > datasets out of the cache altogether. That would be great ! >>> Remember that super-hot objects are likely to come from the pagecache >>> of the storage node in question anyway. >> >> Yes I know that. >> >>> If you do care about fast writes after all, consider de-coupling writes >>> and reads as much as possible. >> >> Write operation will only be done by the admins for datasets update. >> those updates will be plan according the usage of the cluster and >> scheduled during low usage period. > > Good, so the scheme below might work for you, at least the flushing of > dirty data part. Ok, That's exactly what I thought. >>> As in, set your cache to "readforward" (undocumented, google for it), >>> so all un-cached reads will go to the HDDs (they CAN read at near full >>> speed), while all writes will go the cache pool (and eventually to the >>> HDDs, you can time that with lowering the dirty ratio during off-peak >>> hours). >> >> I gonna give a look on that, thanks for the tips. >> >>>> We gonna use an EC pool for big files (jerasure 8+2 I think) and a >>>> replicated pool for small files. >>>> >>>> If I check on http://ceph.com/pgcalc/, in this use case >>>> >>>> replicated pool: pg_num = 8192 for 160 OSDs but 16384 for 240 OSDs >>>> Ec pool : pg_num = 4096 >>>> and pgp_num = pg_num >>>> >>>> Should I set the pg_num to 8192 or 16384 ? what is the impact on the >>>> cluster if we set the pg_num to 16384 at the beginning ? 16384 is >>>> high, isn't it ? >>>> >>> If 24 nodes is the absolute limit of your cluster, you want to set the >>> target pg num to 100 in the calculator, which gives you 8192 again. >>> >>> Keep in mind that splitting PGs is an expensive operation, so if 24 >>> isn't a hard upper limit, you might be better off starting big. >> >> Yes, I did some test on that, it's definitely an expensive operation :) >> >> Thanks for that real useful answer > > No worries, hopefully you can get some other SSDs for journals/cache pools. I can't expect new SSDs right now, so we gonna have to discuss with my colleagues what we could do Thanks. -- Yoann Moulin EPFL IC-IT _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com