> -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Willem Jan Withagen > Sent: 26 June 2017 14:35 > To: Christian Wuerdig <christian.wuerdig@xxxxxxxxx> > Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx> > Subject: Re: Ceph random read IOPS > > On 26-6-2017 09:01, Christian Wuerdig wrote: > > Well, preferring faster clock CPUs for SSD scenarios has been floated > > several times over the last few months on this list. And realistic or > > not, Nick's and Kostas' setup are similar enough (testing single disk) > > that it's a distinct possibility. > > Anyway, as mentioned measuring the performance counters would > probably > > provide more insight. > > I read the advise as: > prefer GHz over cores. > > And especially since there is a sort of balance between either GHz or cores, > that can be an expensive one. Getting both means you have to pay relatively > substantial more money. > > And for an average Ceph server with plenty OSDs, I personally just don't buy > that. There you'd have to look at the total throughput of the the system, and > latency is only one of the many factors. > > Let alone in a cluster with several hosts (and or racks). There the latency is > dictated by the network. So a bad choice of network card or switch will out > do any extra cycles that your CPU can burn. > > I think that just testing 1 OSD is testing artifacts, and has very little to do with > running an actual ceph cluster. > > So if one would like to test this, the test setup should be something > like: 3 hosts with something like 3 disks per host, min_disk=2 and a nice > workload. > Then turn the GHz-knob and see what happens with client latency and > throughput. Did similar tests last summer. 5 nodes with 12x 7.2k disks each, connected via 10G. NVME journal. 3x replica pool. First test was with C-states left to auto and frequency scaling leaving the cores at lowest frequency of 900mhz. The cluster will quite happily do a couple of thousand IO's without generating enough CPU load to boost the 4 cores up to max C-state or frequency. With small background IO going on in background, a QD=1 sequential 4kb write was done with the following results: write: io=115268KB, bw=1670.1KB/s, iops=417, runt= 68986msec slat (usec): min=2, max=414, avg= 4.41, stdev= 3.81 clat (usec): min=966, max=27116, avg=2386.84, stdev=571.57 lat (usec): min=970, max=27120, avg=2391.25, stdev=571.69 clat percentiles (usec): | 1.00th=[ 1480], 5.00th=[ 1688], 10.00th=[ 1912], 20.00th=[ 2128], | 30.00th=[ 2192], 40.00th=[ 2288], 50.00th=[ 2352], 60.00th=[ 2448], | 70.00th=[ 2576], 80.00th=[ 2704], 90.00th=[ 2832], 95.00th=[ 2960], | 99.00th=[ 3312], 99.50th=[ 3536], 99.90th=[ 6112], 99.95th=[ 9536], | 99.99th=[22400] So just under 2.5ms write latency. I don't have the results from the separate C-states/frequency scaling, but adjusting either got me a boost. Forcing to C1 and max frequency of 3.6Ghz got me: write: io=105900KB, bw=5715.7KB/s, iops=1428, runt= 18528msec slat (usec): min=2, max=106, avg= 3.50, stdev= 1.31 clat (usec): min=491, max=32099, avg=694.16, stdev=491.91 lat (usec): min=494, max=32102, avg=697.66, stdev=492.04 clat percentiles (usec): | 1.00th=[ 540], 5.00th=[ 572], 10.00th=[ 588], 20.00th=[ 604], | 30.00th=[ 620], 40.00th=[ 636], 50.00th=[ 652], 60.00th=[ 668], | 70.00th=[ 692], 80.00th=[ 716], 90.00th=[ 764], 95.00th=[ 820], | 99.00th=[ 1448], 99.50th=[ 2320], 99.90th=[ 7584], 99.95th=[11712], | 99.99th=[24448] Quite a bit faster. Although these are best case figures, if any substantial workload is run, the average tends to hover around 1ms latency. Nick > > --WjW > > > On Sun, Jun 25, 2017 at 4:53 AM, Willem Jan Withagen <wjw@xxxxxxxxxxx > > <mailto:wjw@xxxxxxxxxxx>> wrote: > > > > > > > > Op 24 jun. 2017 om 14:17 heeft Maged Mokhtar > <mmokhtar@xxxxxxxxxxx > > <mailto:mmokhtar@xxxxxxxxxxx>> het volgende geschreven: > > > >> My understanding was this test is targeting latency more than > >> IOPS. This is probably why its was run using QD=1. It also makes > >> sense that cpu freq will be more important than cores. > >> > > > > But then it is not generic enough to be used as an advise! > > It is just a line in 3D-space. > > As there are so many > > > > --WjW > > > >> On 2017-06-24 12:52, Willem Jan Withagen wrote: > >> > >>> On 24-6-2017 05:30, Christian Wuerdig wrote: > >>>> The general advice floating around is that your want CPUs with high > >>>> clock speeds rather than more cores to reduce latency and > >>>> increase IOPS > >>>> for SSD setups (see also > >>>> http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/ > >>>> <http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd- > performance/>) > >>>> So > >>>> something like a E5-2667V4 might bring better results in that > >>>> situation. > >>>> Also there was some talk about disabling the processor C states > >>>> in order > >>>> to bring latency down (something like this should be easy to test: > >>>> https://stackoverflow.com/a/22482722/220986 > >>>> <https://stackoverflow.com/a/22482722/220986>) > >>> > >>> I would be very careful to call this a general advice... > >>> > >>> Although the article is interesting, it is rather single sided. > >>> > >>> The only thing is shows that there is a lineair relation between > >>> clockspeed and write or read speeds??? > >>> The article is rather vague on how and what is actually tested. > >>> > >>> By just running a single OSD with no replication a lot of the > >>> functionality is left out of the equation. > >>> Nobody is running just 1 osD on a box in a normal cluster host. > >>> > >>> Not using a serious SSD is another source of noise on the conclusion. > >>> More Queue depth can/will certainly have impact on concurrency. > >>> > >>> I would call this an observation, and nothing more. > >>> > >>> --WjW > >>>> > >>>> On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos > >>>> <reverend.x3@xxxxxxxxx <mailto:reverend.x3@xxxxxxxxx> > >>>> <mailto:reverend.x3@xxxxxxxxx <mailto:reverend.x3@xxxxxxxxx>>> > >>>> wrote: > >>>> > >>>> Hello, > >>>> > >>>> We are in the process of evaluating the performance of a testing > >>>> cluster (3 nodes) with ceph jewel. Our setup consists of: > >>>> 3 monitors (VMs) > >>>> 2 physical servers each connected with 1 JBOD running Ubuntu > >>>> Server > >>>> 16.04 > >>>> > >>>> Each server has 32 threads @2.1GHz and 128GB RAM. > >>>> The disk distribution per server is: > >>>> 38 * HUS726020ALS210 (SAS rotational) > >>>> 2 * HUSMH8010BSS200 (SAS SSD for journals) > >>>> 2 * ST1920FM0043 (SAS SSD for data) > >>>> 1 * INTEL SSDPEDME012T4 (NVME measured with fio ~300K iops) > >>>> > >>>> Since we don't currently have a 10Gbit switch, we test the > >>>> performance > >>>> with the cluster in a degraded state, the noout flag set and > >>>> we mount > >>>> rbd images on the powered on osd node. We confirmed that the > >>>> network > >>>> is not saturated during the tests. > >>>> > >>>> We ran tests on the NVME disk and the pool created on this > >>>> disk where > >>>> we hoped to get the most performance without getting limited > >>>> by the > >>>> hardware specs since we have more disks than CPU threads. > >>>> > >>>> The nvme disk was at first partitioned with one partition > >>>> and the > >>>> journal on the same disk. The performance on random 4K reads > was > >>>> topped at 50K iops. We then removed the osd and partitioned > >>>> with 4 > >>>> data partitions and 4 journals on the same disk. The performance > >>>> didn't increase significantly. Also, since we run read > >>>> tests, the > >>>> journals shouldn't cause performance issues. > >>>> > >>>> We then ran 4 fio processes in parallel on the same rbd > >>>> mounted image > >>>> and the total iops reached 100K. More parallel fio processes > >>>> didn't > >>>> increase the measured iops. > >>>> > >>>> Our ceph.conf is pretty basic (debug is set to 0/0 for > >>>> everything) and > >>>> the crushmap just defines the different buckets/rules for > >>>> the disk > >>>> separation (rotational, ssd, nvme) in order to create the > >>>> required > >>>> pools > >>>> > >>>> Is the performance of 100.000 iops for random 4K read normal > >>>> for a > >>>> disk that on the same benchmark runs at more than 300K iops > >>>> on the > >>>> same hardware or are we missing something? > >>>> > >>>> Best regards, > >>>> Kostas > >>>> _______________________________________________ > >>>> ceph-users mailing list > >>>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > >>>> <mailto:ceph-users@xxxxxxxxxxxxxx > >>>> <mailto:ceph-users@xxxxxxxxxxxxxx>> > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> > >>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>> > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list > >>>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> > >>>> > >>> > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> > >> > >> > >> > >> > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com