Hello, >> I'd like some advices about the setup of a new ceph cluster. Here the >> use case : >> >> RadowGW (S3 and maybe swift for hadoop/spark) will be the main usage. >> Most of the access will be in read only mode. Write access will only be >> done by the admin to update the datasets. >> >> We might use rbd some time to sync data as temp storage (when POSIX is >> needed) but performance will not be an issue here. We might use cephfs >> in the futur if that can replace a filesystem on rdb. >> >> We gonna start with 16 nodes (up to 24). The configuration of each node >> is : >> >> CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (12c/48t) >> Memory : 128GB >> OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1) > > Dedicated OS SSDs aren't really needed, I tend to share OS and > cache/journal SSDs. > That's of course with more durable (S3610) models. I already have those 24 servers running 2 ceph cluster for test right now, so I cannot change anything. we were thinking about share journal but as I mention it below, MON will be on storage server, so that might use too much I/O to share levedb and journal on the same SSD. > Since you didn't mention dedicated MON nodes, make sure that if you plan > to put MONs on storage servers to have fast SSDs in them for the leveldb > (again DC S36xx or 37xx). Yes MON nodes will be shared on storage server. MONs use the SSD 240GB for the leveldb right now. > This will also free up 2 more slots in your (likely Supermicro) chassis > for OSD HDDs. It's not supermicro enclosure, it's Intel one with 12 slot 3.5" front and 2 slots 2.5" back, so I cannot add more disk. the 240GB SSDs are in front. >> Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid) > > These SSDs do not exist according to the Intel site and the only > references I can find for them are on "no longer available" European sites. I made a mistake, it's not 400 but 480GB, smartctl give me Model SSDSC2BB480H4 > Since you're in the land of rich chocolate bankers, I assume that this > model is something that just happened in Europe. I'm just a poor sysadmin with expensive toy in a University ;) > Without knowing the specifications for these SSDs, I can't recommend them. > I'd use DC S3610 or 3710 instead, this very much depends on how much > endurance (TPW) you need. As I write above, I already have those SSDs so I look for the best setup with the hardware I have. >> OSD Disk : 10 x HGST ultrastar-7k6000 6TB >> Public Network : 1 x 10Gb/s >> Private Network : 1 x 10Gb/s >> OS : Ubuntu 16.04 >> Ceph version : Jewel >> >> The question is : journal or cache tier (read only) on the SD 400GB >> Intel S3300 DC ? >> > You said read-only, or read-mostly up there. I mean, I think about using cache tier for read operation. No write operation gonna use the cache tier. I don't know yet wich mode I gonna use, I have to do some tests. > So why journals (only helpful for writes) or cache tiers (your 2 SSDs may > not be faster than your 10 HDDs for reads) at all? We plan to have eavy read access some time so we think about to have cache tier on SSD to speed up the throughput and decrease the I/O pressure on disk. I might be wrong on that. > Mind, if you have the money, go for it! I don't have the money, I have the hardware :) >> Each disk is able to write sequentially at 220MB/s. SSDs can write at >> ~500MB/s. if we set 5 journals on each SSDs, SSD will still be the >> bottleneck (1GB/s vs 2GB/s). > > Your filestore based OSDs will never write Ceph data at 220MB/s, 100 would > be pushing it. > So no, your journal SSDs won't be the limiting factor, though 5 journals > on one SSD is pushing my comfort zone when it comes to SPoFs. > >> If we set the journal on OSDs, we can >> expect a good throughput in read on the disk (in case of data not in the >> cache) and write shouldn't be so bad too, even if we have random read on >> the OSD during the write ? >> >> SSDs as cache tier seem to be a better usage than only 5 journal on >> each ? Is that correct ? >> > Potentially, depends on your actual usage. > > Again, since you said read-mostly, the question with a cache-tier becomes, > how much of your truly hot data can fit into it? That the biggest point, many datasets will fit into the cache, but some of them will definitely be too big (+100TB) but in that case, Our user know what going one. > Remember that super-hot objects are likely to come from the pagecache of > the storage node in question anyway. Yes I know that. > If you do care about fast writes after all, consider de-coupling writes > and reads as much as possible. Write operation will only be done by the admins for datasets update. those updates will be plan according the usage of the cluster and scheduled during low usage period. > As in, set your cache to "readforward" (undocumented, google for it), so > all un-cached reads will go to the HDDs (they CAN read at near full speed), > while all writes will go the cache pool (and eventually to the HDDs, you > can time that with lowering the dirty ratio during off-peak hours). I gonna give a look on that, thanks for the tips. >> We gonna use an EC pool for big files (jerasure 8+2 I think) and a >> replicated pool for small files. >> >> If I check on http://ceph.com/pgcalc/, in this use case >> >> replicated pool: pg_num = 8192 for 160 OSDs but 16384 for 240 OSDs >> Ec pool : pg_num = 4096 >> and pgp_num = pg_num >> >> Should I set the pg_num to 8192 or 16384 ? what is the impact on the >> cluster if we set the pg_num to 16384 at the beginning ? 16384 is high, >> isn't it ? >> > If 24 nodes is the absolute limit of your cluster, you want to set the > target pg num to 100 in the calculator, which gives you 8192 again. > > Keep in mind that splitting PGs is an expensive operation, so if 24 isn't > a hard upper limit, you might be better off starting big. Yes, I did some test on that, it's definitely an expensive operation :) Thanks for that real useful answer -- Yoann Moulin EPFL IC-IT _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com