Hello, More line breaks, formatting. A wall of text makes people less likely to read things. On Fri, 2 Oct 2015 07:08:29 +0000 Javier C.A. wrote: > Hello > Before posting this message, I've been reading older posts in the > mailing list, but I didn't get any clear answer..... Define performance. Many people seem to be fascinated by the speed of sequential (more or less) writes and reads, while their use case would actually be better served by an increased small IOPS performance. >I happen to have > three servers available to test Ceph, and I would like to know if there > is any kind of "performance prediction formula". If there is such a thing (that actually works with less than a 10% error margin), I'm sure RedHat would like to charge you for it. ^_- >-My OSD servers are: > - 1 x Intel E5-2603v3 1.6Ghz (6 cores) Slightly underpowered, especially when it comes to small write IOPS. My personal formula is at least 2GHz per OSD with SSD journal. >- 32G RAM D4 OK, more is better (for read performance, see below). >- 10Gb ethernet network, jumbo frames enabled - Slight overkill given the rest of your setup, I guess you saw all the fun people keep having with jumbo frames in the ML archives. >SSOO: 2 x 500GB RAID 1 >- OSD (6 OSD): - 2TB 7200 SATA4 6Gbps >- 1 x SSD Intel SC3700 200GB for > journaling of all 6 OSDs. - This means that the most throughput you'll ever be able to write to those nodes is the speed of that SSD, 365MB/s, lets make that 350MB/s. Thus the slight overkill comment earlier. OTOH the HDDs get to use most of the IOPS (after discounting FS journals, overhead, the OSD leveldb, etc). So lets say slightly less than 100 IOPS per OSD. >Replication factor = 2. see below. >- XFS I find Ext4 faster, but that's me. >-MON nodes > will be running in other servers. With this OSD setup, how could I > predict the cpeh cluster performace (IOPS, R/W BW, latency...)? Of these, latency is the trickiest one, as so many things factor into it aside from the network. A test case where you're hitting basically just one OSD will look a lot worse than what an evenly spread out (more threads over a sufficiently large data set) test would. Userspace (librbd) results can/will vastly differ from kernel RBD clients. IOPS is a totally worthless data point w/o clearly defining what you're measuring how. Lets assume the "standard" of 4KB blocks and 32threads, random writes. Also lets assume a replication factor of 3, see below. Sustained sync'ed (direct=1 option in fio) IOPS with your setup will be in the 500 to 600 range (given a quiescent cluster). This of course can change dramatically with non-direct writes and caching (kernel page cache and/or RBD client caches). The same is true for reads, if your data set fits into the page caches of your storage nodes, it will be fast, if everything needs to be read from the HDDs, you're back to what these devices can do (~100 IOPS per HDD). To give you a concrete example, on my test cluster I have 5 nodes, 4 HDDs/OSDs each and no journal SSDs. So that's in theory 100 IOPS per HDD, divided by 2 for the on-disk journal, divided by 3 for replication: 20*100/2/3=333 Which amazingly is what I get with rados bench and 4K blocks, fio from a kernel client and direct I/O is around 200. BW, as in throughput is easier, about 350MB/s max for sustained sequential writes (the limit of the journal SSD) and lets say 750MB/s for sustained reads. Again, if you're reading just 8GB in your tests and that fits nicely in the page caches of the OSDs, it will be wire speed. >Should I configure a replica factor of 3? > If you value your data, which you will on a production server, then yes. This will of course cost you 1/3 of your performance compared to replica 2. Regards, Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com