Would be nice to see your output of: rados bench -p rbd 60 write --no-cleanup -t 56 -b 4096 -o 1M Total time run: 60.005452 Total writes made: 438295 Write size: 4096 Object size: 1048576 Bandwidth (MB/sec): 28.5322 Stddev Bandwidth: 0.514721 Max bandwidth (MB/sec): 29.5781 Min bandwidth (MB/sec): 27.1328 Average IOPS: 7304 Stddev IOPS: 131 Max IOPS: 7572 Min IOPS: 6946 Average Latency(s): 0.00766615 Stddev Latency(s): 0.00276119 Max latency(s): 0.0481837 Min latency(s): 0.000474167 in real live Blocksize of only 4096 bytes is not really common i think :) Regards Gerhard W. Recher net4sec UG (haftungsbeschränkt) Leitenweg 6 86929 Penzing +49 171 4802507 Am 26.10.2017 um 19:01 schrieb Maged Mokhtar: > > > I wish the firmware update will fix things for you. > Regarding monitoring: if your tool is able to record disk busy%, iops, > throughout then you do not need to run atop. > > I still highly recommend you run the fio SSD test for sync writes: > https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ > <https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/%20> > > The other important factor for SSDs is they should have commercial > grade endurance/DWPD > > In the absence of other load, if you stress your cluster using rados > 4k benchmark (i recommended 4k since this was the block sizes you were > getting when doing RAID comparison in your initial post ), your load > will be dominated by iops performance. You should be easily see-ing a > couple of thousand iops on a raw disk level, on a cluster level with > 30 disks, you should be roughly approaching 30 x actual raw disk iops > for 4k reads and about 5 x for writes ( due to replicas and journal > seeks ). If you were using fast SSDs ( 10k+ iops per disk), you will > start hitting other bottlenecks like cpu% but your case is far from > this. In your case to get decent cluster iops performance you should > be aiming to get a couple of thousand iops at the raw disk level and a > busy% of below 90% during rados 4k test. > > > > Maged > > On 2017-10-26 16:44, Russell Glaue wrote: > >> >> On Wed, Oct 25, 2017 at 7:09 PM, Maged Mokhtar <mmokhtar@xxxxxxxxxxx >> <mailto:mmokhtar@xxxxxxxxxxx>> wrote: >> >> It depends on what stage you are in: >> in production, probably the best thing is to setup a monitoring >> tool (collectd/grahite/prometheus/grafana) to monitor both ceph >> stats as well as resource load. This will, among other things, >> show you if you have slowing disks. >> >> I am monitoring Ceph performance with ceph-dash >> (http://cephdash.crapworks.de/), that is why I knew to look into the >> slow writes issue. And I am using Monitorix >> (http://www.monitorix.org/) to monitor system resources, including >> Disk I/O. >> >> However, though I can monitor individual disk performance at the >> system level, it seems Ceph does not tax any disk more than the worst >> disk. So in my monitoring charts, all disks have the same performance. >> All four nodes are base-lining at 50 writes/sec during the cluster's >> normal load, with the non-problem hosts spiking up to 150, and the >> problem host only spikes up to 100. >> But during the window of time I took the problem host OSDs down to >> run the bench tests, the OSDs on the other nodes increased to 300-500 >> writes/sec. Otherwise, the chart looks the same for all disks on all >> ceph nodes/hosts. >> >> >> Before production you should first make sure your SSDs are >> suitable for Ceph, either by being recommend by other Ceph users >> or you test them yourself for sync writes performance using fio >> tool as outlined earlier. Then after you build your cluster you >> can use rados and/or rbd bencmark tests to benchmark your cluster >> and find bottlenecks using atop/sar/collectl which will help you >> tune your cluster. >> >> All 36 OSDs are: Crucial_CT960M500SSD1 >> >> Rados bench tests were done at the beginning. The speed was much >> faster than it is now. I cannot recall the test results, someone else >> on my team ran them. Recently, I had thought the slow disk problem >> was a configuration issue with Ceph - before I posted here. Now we >> are hoping it may be resolved with a firmware update. (If it is >> firmware related, rebooting the problem node may temporarily resolve >> this) >> >> >> Though you did see better improvements, your cluster with 27 SSDs >> should give much higher numbers than 3k iops. If you are running >> rados bench while you have other client ios, then obviously the >> reported number by the tool will be less than what the cluster is >> actually giving...which you can find out via ceph status command, >> it will print the total cluster throughput and iops. If the total >> is still low i would recommend running the fio raw disk test, >> maybe the disks are not suitable. When you removed your 9 bad >> disk from 36 and your performance doubled, you still had 2 other >> disk slowing you..meaning near 100% busy ? It makes me feel the >> disk type used is not good. For these near 100% busy disks can >> you also measure their raw disk iops at that load (i am not sure >> atop shows this, if not use sat/syssyat/iostat/collecl). >> >> I ran another bench test today with all 36 OSDs up. The overall >> performance was improved slightly compared to the original tests. >> Only 3 OSDs on the problem host were increasing to 101% disk busy. >> The iops reported from ceph status during this bench test ranged from >> 1.6k to 3.3k, the test yielding 4k iops. >> >> Yes, the two other OSDs/disks that were the bottleneck were at 101% >> disk busy. The other OSD disks on the same host were sailing along at >> like 50-60% busy. >> >> All 36 OSD disks are exactly the same disk. They were all purchased >> at the same time. All were installed at the same time. >> I cannot believe it is a problem with the disk model. A failed/bad >> disk, perhaps is possible. But the disk model itself cannot be the >> problem based on what I am seeing. If I am seeing bad performance on >> all disks on one ceph node/host, but not on another ceph node with >> these same disks, it has to be some other factor. This is why I am >> now guessing a firmware upgrade is needed. >> >> Also, as I eluded to here earlier. I took down all 9 OSDs in the >> problem host yesterday to run the bench test. >> Today, with those 9 OSDs back online, I rerun the bench test, I am >> see 2-3 OSD disks with 101% busy on the problem host, and the other >> disks are lower than 80%. So, for whatever reason, shutting down the >> OSDs and starting them back up, allowed many (not all) of the OSDs >> performance to improve on the problem host. >> >> >> >> Maged >> >> On 2017-10-25 23:44, Russell Glaue wrote: >> >> Thanks to all. >> I took the OSDs down in the problem host, without shutting >> down the machine. >> As predicted, our MB/s about doubled. >> Using this bench/atop procedure, I found two other OSDs on >> another host that are the next bottlenecks. >> >> Is this the only good way to really test the performance of >> the drives as OSDs? Is there any other way? >> >> While running the bench on all 36 OSDs, the 9 problem OSDs >> stuck out. But two new problem OSDs I just discovered in this >> recent test of 27 OSDs did not stick out at all. Because ceph >> bench distributes the load making only the very worst >> denominators show up in atop. So ceph is a slow as your >> slowest drive. >> >> It would be really great if I could run the bench test, and >> some how get the bench to use only certain OSDs during the >> test. Then I could run the test, avoiding the OSDs that I >> already know is a problem, so I can find the next worst OSD. >> >> >> [ the bench test ] >> rados bench -p scbench -b 4096 30 write -t 32 >> >> [ original results with all 36 OSDs ] >> Total time run: 30.822350 >> Total writes made: 31032 >> Write size: 4096 >> Object size: 4096 >> Bandwidth (MB/sec): 3.93282 >> Stddev Bandwidth: 3.66265 >> Max bandwidth (MB/sec): 13.668 >> Min bandwidth (MB/sec): 0 >> Average IOPS: 1006 >> Stddev IOPS: 937 >> Max IOPS: 3499 >> Min IOPS: 0 >> Average Latency(s): 0.0317779 >> Stddev Latency(s): 0.164076 >> Max latency(s): 2.27707 >> Min latency(s): 0.0013848 >> Cleaning up (deleting benchmark objects) >> Clean up completed and total clean up time :20.166559 >> >> [ after stopping all of the OSDs (9) on the problem host ] >> Total time run: 32.586830 >> Total writes made: 59491 >> Write size: 4096 >> Object size: 4096 >> Bandwidth (MB/sec): 7.13131 >> Stddev Bandwidth: 9.78725 >> Max bandwidth (MB/sec): 29.168 >> Min bandwidth (MB/sec): 0 >> Average IOPS: 1825 >> Stddev IOPS: 2505 >> Max IOPS: 7467 >> Min IOPS: 0 >> Average Latency(s): 0.0173691 >> Stddev Latency(s): 0.21634 >> Max latency(s): 6.71283 >> Min latency(s): 0.00107473 >> Cleaning up (deleting benchmark objects) >> Clean up completed and total clean up time :16.269393 >> >> >> >> On Fri, Oct 20, 2017 at 1:35 PM, Russell Glaue >> <rglaue@xxxxxxxx <mailto:rglaue@xxxxxxxx>> wrote: >> >> On the machine in question, the 2nd newest, we are using >> the LSI MegaRAID SAS-3 3008 [Fury], which allows us a >> "Non-RAID" option, and has no battery. The older two use >> the LSI MegaRAID SAS 2208 [Thunderbolt] I reported >> earlier, each single drive configured as RAID0. >> >> Thanks for everyone's help. >> I am going to run a 32 thread bench test after taking the >> 2nd machine out of the cluster with noout. >> After it is out of the cluster, I am expecting the slow >> write issue will not surface. >> >> >> On Fri, Oct 20, 2017 at 5:27 AM, David Turner >> <drakonstein@xxxxxxxxx <mailto:drakonstein@xxxxxxxxx>> wrote: >> >> I can attest that the battery in the raid controller >> is a thing. I'm used to using lsi controllers, but my >> current position has hp raid controllers and we just >> tracked down 10 of our nodes that had >100ms await >> pretty much always were the only 10 nodes in the >> cluster with failed batteries on the raid controllers. >> >> >> On Thu, Oct 19, 2017, 8:15 PM Christian Balzer >> <chibi@xxxxxxx <mailto:chibi@xxxxxxx>> wrote: >> >> >> Hello, >> >> On Thu, 19 Oct 2017 17:14:17 -0500 Russell Glaue >> wrote: >> >> > That is a good idea. >> > However, a previous rebalancing processes has >> brought performance of our >> > Guest VMs to a slow drag. >> > >> >> Never mind that I'm not sure that these SSDs are >> particular well suited >> for Ceph, your problem is clearly located on that >> one node. >> >> Not that I think it's the case, but make sure >> your PG distribution is not >> skewed with many more PGs per OSD on that node. >> >> Once you rule that out my first guess is the RAID >> controller, you're >> running the SSDs are single RAID0s I presume? >> If so a either configuration difference or a >> failed BBU on the controller >> could result in the writeback cache being >> disabled, which would explain >> things beautifully. >> >> As for a temporary test/fix (with reduced >> redundancy of course), set noout >> (or mon_osd_down_out_subtree_limit accordingly) >> and turn the slow host off. >> >> This should result in much better performance >> than you have now and of >> course be the final confirmation of that host >> being the culprit. >> >> Christian >> >> > >> > On Thu, Oct 19, 2017 at 3:55 PM, Jean-Charles >> Lopez <jelopez@xxxxxxxxxx >> <mailto:jelopez@xxxxxxxxxx>> >> > wrote: >> > >> > > Hi Russell, >> > > >> > > as you have 4 servers, assuming you are not >> doing EC pools, just stop all >> > > the OSDs on the second questionable server, >> mark the OSDs on that server as >> > > out, let the cluster rebalance and when all >> PGs are active+clean just >> > > replay the test. >> > > >> > > All IOs should then go only to the other 3 >> servers. >> > > >> > > JC >> > > >> > > On Oct 19, 2017, at 13:49, Russell Glaue >> <rglaue@xxxxxxxx <mailto:rglaue@xxxxxxxx>> wrote: >> > > >> > > No, I have not ruled out the disk controller >> and backplane making the >> > > disks slower. >> > > Is there a way I could test that theory, >> other than swapping out hardware? >> > > -RG >> > > >> > > On Thu, Oct 19, 2017 at 3:44 PM, David Turner >> <drakonstein@xxxxxxxxx >> <mailto:drakonstein@xxxxxxxxx>> >> > > wrote: >> > > >> > >> Have you ruled out the disk controller and >> backplane in the server >> > >> running slower? >> > >> >> > >> On Thu, Oct 19, 2017 at 4:42 PM Russell >> Glaue <rglaue@xxxxxxxx <mailto:rglaue@xxxxxxxx>> >> wrote: >> > >> >> > >>> I ran the test on the Ceph pool, and ran >> atop on all 4 storage servers, >> > >>> as suggested. >> > >>> >> > >>> Out of the 4 servers: >> > >>> 3 of them performed with 17% to 30% disk >> %busy, and 11% CPU wait. >> > >>> Momentarily spiking up to 50% on one >> server, and 80% on another >> > >>> The 2nd newest server was almost averaging >> 90% disk %busy and 150% CPU >> > >>> wait. And more than momentarily spiking to >> 101% disk busy and 250% CPU wait. >> > >>> For this 2nd newest server, this was the >> statistics for about 8 of 9 >> > >>> disks, with the 9th disk not far behind the >> others. >> > >>> >> > >>> I cannot believe all 9 disks are bad >> > >>> They are the same disks as the newest 1st >> server, Crucial_CT960M500SSD1, >> > >>> and same exact server hardware too. >> > >>> They were purchased at the same time in the >> same purchase order and >> > >>> arrived at the same time. >> > >>> So I cannot believe I just happened to put >> 9 bad disks in one server, >> > >>> and 9 good ones in the other. >> > >>> >> > >>> I know I have Ceph configured exactly the >> same on all servers >> > >>> And I am sure I have the hardware settings >> configured exactly the same >> > >>> on the 1st and 2nd servers. >> > >>> So if I were someone else, I would say it >> maybe is bad hardware on the >> > >>> 2nd server. >> > >>> But the 2nd server is running very well >> without any hint of a problem. >> > >>> >> > >>> Any other ideas or suggestions? >> > >>> >> > >>> -RG >> > >>> >> > >>> >> > >>> On Wed, Oct 18, 2017 at 3:40 PM, Maged >> Mokhtar <mmokhtar@xxxxxxxxxxx >> <mailto:mmokhtar@xxxxxxxxxxx>> >> > >>> wrote: >> > >>> >> > >>>> just run the same 32 threaded rados test >> as you did before and this >> > >>>> time run atop while the test is running >> looking for %busy of cpu/disks. It >> > >>>> should give an idea if there is a >> bottleneck in them. >> > >>>> >> > >>>> On 2017-10-18 21:35, Russell Glaue wrote: >> > >>>> >> > >>>> I cannot run the write test reviewed at >> the ceph-how-to-test-if-your-s >> > >>>> sd-is-suitable-as-a-journal-device blog. >> The tests write directly to >> > >>>> the raw disk device. >> > >>>> Reading an infile (created with urandom) >> on one SSD, writing the >> > >>>> outfile to another osd, yields about 17MB/s. >> > >>>> But Isn't this write speed limited by the >> speed in which in the dd >> > >>>> infile can be read? >> > >>>> And I assume the best test should be run >> with no other load. >> > >>>> >> > >>>> How does one run the rados bench "as stress"? >> > >>>> >> > >>>> -RG >> > >>>> >> > >>>> >> > >>>> On Wed, Oct 18, 2017 at 1:33 PM, Maged >> Mokhtar <mmokhtar@xxxxxxxxxxx >> <mailto:mmokhtar@xxxxxxxxxxx>> >> > >>>> wrote: >> > >>>> >> > >>>>> measuring resource load as outlined >> earlier will show if the drives >> > >>>>> are performing well or not. Also how many >> osds do you have ? >> > >>>>> >> > >>>>> On 2017-10-18 19:26, Russell Glaue wrote: >> > >>>>> >> > >>>>> The SSD drives are Crucial M500 >> > >>>>> A Ceph user did some benchmarks and found >> it had good performance >> > >>>>> >> https://forum.proxmox.com/threads/ceph-bad-performance-in- >> <https://forum.proxmox.com/threads/ceph-bad-performance-in-> >> > >>>>> qemu-guests.21551/ >> > >>>>> >> > >>>>> However, a user comment from 3 years ago >> on the blog post you linked >> > >>>>> to says to avoid the Crucial M500 >> > >>>>> >> > >>>>> Yet, this performance posting tells that >> the Crucial M500 is good. >> > >>>>> >> https://inside.servers.com/ssd-performance-2017-c4307a92dea >> <https://inside.servers.com/ssd-performance-2017-c4307a92dea> >> > >>>>> >> > >>>>> On Wed, Oct 18, 2017 at 11:53 AM, Maged >> Mokhtar <mmokhtar@xxxxxxxxxxx >> <mailto:mmokhtar@xxxxxxxxxxx>> >> > >>>>> wrote: >> > >>>>> >> > >>>>>> Check out the following link: some SSDs >> perform bad in Ceph due to >> > >>>>>> sync writes to journal >> > >>>>>> >> > >>>>>> >> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-tes >> <https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-tes> >> > >>>>>> >> t-if-your-ssd-is-suitable-as-a-journal-device/ >> > >>>>>> >> > >>>>>> Anther thing that can help is to re-run >> the rados 32 threads as >> > >>>>>> stress and view resource usage using >> atop (or collectl/sar) to check for >> > >>>>>> %busy cpu and %busy disks to give you an >> idea of what is holding down your >> > >>>>>> cluster..for example: if cpu/disk % are >> all low then check your >> > >>>>>> network/switches. If disk %busy is high >> (90%) for all disks then your >> > >>>>>> disks are the bottleneck: which either >> means you have SSDs that are not >> > >>>>>> suitable for Ceph or you have too few >> disks (which i doubt is the case). If >> > >>>>>> only 1 disk %busy is high, there may be >> something wrong with this disk >> > >>>>>> should be removed. >> > >>>>>> >> > >>>>>> Maged >> > >>>>>> >> > >>>>>> On 2017-10-18 18:13, Russell Glaue wrote: >> > >>>>>> >> > >>>>>> In my previous post, in one of my points >> I was wondering if the >> > >>>>>> request size would increase if I enabled >> jumbo packets. currently it is >> > >>>>>> disabled. >> > >>>>>> >> > >>>>>> @jdillama: The qemu settings for both >> these two guest machines, with >> > >>>>>> RAID/LVM and Ceph/rbd images, are the >> same. I am not thinking that changing >> > >>>>>> the qemu settings of >> "min_io_size=<limited to 16bits>,opt_io_size=<RBD >> > >>>>>> image object size>" will directly >> address the issue. >> > >>>>>> >> > >>>>>> @mmokhtar: Ok. So you suggest the >> request size is the result of the >> > >>>>>> problem and not the cause of the >> problem. meaning I should go after a >> > >>>>>> different issue. >> > >>>>>> >> > >>>>>> I have been trying to get write speeds >> up to what people on this mail >> > >>>>>> list are discussing. >> > >>>>>> It seems that for our configuration, as >> it matches others, we should >> > >>>>>> be getting about 70MB/s write speed. >> > >>>>>> But we are not getting that. >> > >>>>>> Single writes to disk are lucky to get >> 5MB/s to 6MB/s, but are >> > >>>>>> typically 1MB/s to 2MB/s. >> > >>>>>> Monitoring the entire Ceph cluster (using >> > >>>>>> http://cephdash.crapworks.de/), I have >> seen very rare momentary >> > >>>>>> spikes up to 30MB/s. >> > >>>>>> >> > >>>>>> My storage network is connected via a >> 10Gb switch >> > >>>>>> I have 4 storage servers with a LSI >> Logic MegaRAID SAS 2208 controller >> > >>>>>> Each storage server has 9 1TB SSD >> drives, each drive as 1 osd (no >> > >>>>>> RAID) >> > >>>>>> Each drive is one LVM group, with two >> volumes - one volume for the >> > >>>>>> osd, one volume for the journal >> > >>>>>> Each osd is formatted with xfs >> > >>>>>> The crush map is simple: >> default->rack->[host[1..4]->osd] with an >> > >>>>>> evenly distributed weight >> > >>>>>> The redundancy is triple replication >> > >>>>>> >> > >>>>>> While I have read comments that having >> the osd and journal on the >> > >>>>>> same disk decreases write speed, I have >> also read that once past 8 OSDs per >> > >>>>>> node this is the recommended >> configuration, however this is also the reason >> > >>>>>> why SSD drives are used exclusively for >> OSDs in the storage nodes. >> > >>>>>> None-the-less, I was still expecting >> write speeds to be above 30MB/s, >> > >>>>>> not below 6MB/s. >> > >>>>>> Even at 12x slower than the RAID, using >> my previously posted iostat >> > >>>>>> data set, I should be seeing write >> speeds that average 10MB/s, not 2MB/s. >> > >>>>>> >> > >>>>>> In regards to the rados benchmark tests >> you asked me to run, here is >> > >>>>>> the output: >> > >>>>>> >> > >>>>>> [centos7]# rados bench -p scbench -b >> 4096 30 write -t 1 >> > >>>>>> Maintaining 1 concurrent writes of 4096 >> bytes to objects of size 4096 >> > >>>>>> for up to 30 seconds or 0 objects >> > >>>>>> Object prefix: >> benchmark_data_hamms.sys.cu >> <http://benchmark_data_hamms.sys.cu>.cait.org_85049 >> > >>>>>> sec Cur ops started finished avg >> MB/s cur MB/s last lat(s) >> > >>>>>> avg lat(s) >> > >>>>>> 0 0 0 0 >> 0 0 - >> > >>>>>> 0 >> > >>>>>> 1 1 201 200 >> 0.78356 0.78125 0.00522307 >> > >>>>>> 0.00496574 >> > >>>>>> 2 1 469 468 >> 0.915303 1.04688 0.00437497 >> > >>>>>> 0.00426141 >> > >>>>>> 3 1 741 740 >> 0.964371 1.0625 0.00512853 >> > >>>>>> 0.0040434 >> > >>>>>> 4 1 888 887 >> 0.866739 0.574219 0.00307699 >> > >>>>>> 0.00450177 >> > >>>>>> 5 1 1147 1146 >> 0.895725 1.01172 0.00376454 >> > >>>>>> 0.0043559 >> > >>>>>> 6 1 1325 1324 >> 0.862293 0.695312 0.00459443 >> > >>>>>> 0.004525 >> > >>>>>> 7 1 1494 1493 >> 0.83339 0.660156 0.00461002 >> > >>>>>> 0.00458452 >> > >>>>>> 8 1 1736 1735 >> 0.847369 0.945312 0.00253971 >> > >>>>>> 0.00460458 >> > >>>>>> 9 1 1998 1997 >> 0.866922 1.02344 0.00236573 >> > >>>>>> 0.00450172 >> > >>>>>> 10 1 2260 2259 >> 0.882563 1.02344 0.00262179 >> > >>>>>> 0.00442152 >> > >>>>>> 11 1 2526 2525 >> 0.896775 1.03906 0.00336914 >> > >>>>>> 0.00435092 >> > >>>>>> 12 1 2760 2759 >> 0.898203 0.914062 0.00351827 >> > >>>>>> 0.00434491 >> > >>>>>> 13 1 3016 3015 >> 0.906025 1 0.00335703 >> > >>>>>> 0.00430691 >> > >>>>>> 14 1 3257 3256 >> 0.908545 0.941406 0.00332344 >> > >>>>>> 0.00429495 >> > >>>>>> 15 1 3490 3489 >> 0.908644 0.910156 0.00318815 >> > >>>>>> 0.00426387 >> > >>>>>> 16 1 3728 3727 >> 0.909952 0.929688 0.0032881 >> > >>>>>> 0.00428895 >> > >>>>>> 17 1 3986 3985 >> 0.915703 1.00781 0.00274809 >> > >>>>>> 0.0042614 >> > >>>>>> 18 1 4250 4249 >> 0.922116 1.03125 0.00287411 >> > >>>>>> 0.00423214 >> > >>>>>> 19 1 4505 4504 >> 0.926003 0.996094 0.00375435 >> > >>>>>> 0.00421442 >> > >>>>>> 2017-10-18 10:56:31.267173 min lat: >> 0.00181259 max lat: 0.270553 avg >> > >>>>>> lat: 0.00420118 >> > >>>>>> sec Cur ops started finished avg >> MB/s cur MB/s last lat(s) >> > >>>>>> avg lat(s) >> > >>>>>> 20 1 4757 4756 >> 0.928915 0.984375 0.00463972 >> > >>>>>> 0.00420118 >> > >>>>>> 21 1 5009 5008 >> 0.93155 0.984375 0.00360065 >> > >>>>>> 0.00418937 >> > >>>>>> 22 1 5235 5234 >> 0.929329 0.882812 0.00626214 >> > >>>>>> 0.004199 >> > >>>>>> 23 1 5500 5499 >> 0.933925 1.03516 0.00466584 >> > >>>>>> 0.00417836 >> > >>>>>> 24 1 5708 5707 >> 0.928861 0.8125 0.00285727 >> > >>>>>> 0.00420146 >> > >>>>>> 25 0 5964 5964 >> 0.931858 1.00391 0.00417383 >> > >>>>>> 0.0041881 >> > >>>>>> 26 1 6216 6215 >> 0.933722 0.980469 0.0041009 >> > >>>>>> 0.00417915 >> > >>>>>> 27 1 6481 6480 >> 0.937474 1.03516 0.00307484 >> > >>>>>> 0.00416118 >> > >>>>>> 28 1 6745 6744 >> 0.940819 1.03125 0.00266329 >> > >>>>>> 0.00414777 >> > >>>>>> 29 1 7003 7002 >> 0.943124 1.00781 0.00305905 >> > >>>>>> 0.00413758 >> > >>>>>> 30 1 7271 7270 >> 0.946578 1.04688 0.00391017 >> > >>>>>> 0.00412238 >> > >>>>>> Total time run: 30.006060 >> > >>>>>> Total writes made: 7272 >> > >>>>>> Write size: 4096 >> > >>>>>> Object size: 4096 >> > >>>>>> Bandwidth (MB/sec): 0.946684 >> > >>>>>> Stddev Bandwidth: 0.123762 >> > >>>>>> Max bandwidth (MB/sec): 1.0625 >> > >>>>>> Min bandwidth (MB/sec): 0.574219 >> > >>>>>> Average IOPS: 242 >> > >>>>>> Stddev IOPS: 31 >> > >>>>>> Max IOPS: 272 >> > >>>>>> Min IOPS: 147 >> > >>>>>> Average Latency(s): 0.00412247 >> > >>>>>> Stddev Latency(s): 0.00648437 >> > >>>>>> Max latency(s): 0.270553 >> > >>>>>> Min latency(s): 0.00175318 >> > >>>>>> Cleaning up (deleting benchmark objects) >> > >>>>>> Clean up completed and total clean up >> time :29.069423 >> > >>>>>> >> > >>>>>> [centos7]# rados bench -p scbench -b >> 4096 30 write -t 32 >> > >>>>>> Maintaining 32 concurrent writes of 4096 >> bytes to objects of size >> > >>>>>> 4096 for up to 30 seconds or 0 objects >> > >>>>>> Object prefix: >> benchmark_data_hamms.sys.cu >> <http://benchmark_data_hamms.sys.cu>.cait.org_86076 >> > >>>>>> sec Cur ops started finished avg >> MB/s cur MB/s last lat(s) >> > >>>>>> avg lat(s) >> > >>>>>> 0 0 0 0 >> 0 0 - >> > >>>>>> 0 >> > >>>>>> 1 32 3013 2981 >> 11.6438 11.6445 0.00247906 >> > >>>>>> 0.00572026 >> > >>>>>> 2 32 5349 5317 >> 10.3834 9.125 0.00246662 >> > >>>>>> 0.00932016 >> > >>>>>> 3 32 5707 5675 >> 7.3883 1.39844 0.00389774 >> > >>>>>> 0.0156726 >> > >>>>>> 4 32 5895 5863 >> 5.72481 0.734375 1.13137 >> > >>>>>> 0.0167946 >> > >>>>>> 5 32 6869 6837 >> 5.34068 3.80469 0.0027652 >> > >>>>>> 0.0226577 >> > >>>>>> 6 32 8901 8869 >> 5.77306 7.9375 0.0053211 >> > >>>>>> 0.0216259 >> > >>>>>> 7 32 10800 10768 >> 6.00785 7.41797 0.00358187 >> > >>>>>> 0.0207418 >> > >>>>>> 8 32 11825 11793 >> 5.75728 4.00391 0.00217575 >> > >>>>>> 0.0215494 >> > >>>>>> 9 32 12941 12909 >> 5.6019 4.35938 0.00278512 >> > >>>>>> 0.0220567 >> > >>>>>> 10 32 13317 13285 >> 5.18849 1.46875 0.0034973 >> > >>>>>> 0.0240665 >> > >>>>>> 11 32 16189 16157 >> 5.73653 11.2188 0.00255841 >> > >>>>>> 0.0212708 >> > >>>>>> 12 32 16749 16717 >> 5.44077 2.1875 0.00330334 >> > >>>>>> 0.0215915 >> > >>>>>> 13 32 16756 16724 >> 5.02436 0.0273438 0.00338994 >> > >>>>>> 0.021849 >> > >>>>>> 14 32 17908 17876 >> 4.98686 4.5 0.00402598 >> > >>>>>> 0.0244568 >> > >>>>>> 15 32 17936 17904 >> 4.66171 0.109375 0.00375799 >> > >>>>>> 0.0245545 >> > >>>>>> 16 32 18279 18247 >> 4.45409 1.33984 0.00483873 >> > >>>>>> 0.0267929 >> > >>>>>> 17 32 18372 18340 >> 4.21346 0.363281 0.00505187 >> > >>>>>> 0.0275887 >> > >>>>>> 18 32 19403 19371 >> 4.20309 4.02734 0.00545154 >> > >>>>>> 0.029348 >> > >>>>>> 19 31 19845 19814 >> 4.07295 1.73047 0.00254726 >> > >>>>>> 0.0306775 >> > >>>>>> 2017-10-18 10:57:58.160536 min lat: >> 0.0015005 max lat: 2.27707 avg >> > >>>>>> lat: 0.0307559 >> > >>>>>> sec Cur ops started finished avg >> MB/s cur MB/s last lat(s) >> > >>>>>> avg lat(s) >> > >>>>>> 20 31 20401 20370 >> 3.97788 2.17188 0.00307238 >> > >>>>>> 0.0307559 >> > >>>>>> 21 32 21338 21306 >> 3.96254 3.65625 0.00464563 >> > >>>>>> 0.0312288 >> > >>>>>> 22 32 23057 23025 >> 4.0876 6.71484 0.00296295 >> > >>>>>> 0.0299267 >> > >>>>>> 23 32 23057 23025 >> 3.90988 0 - >> > >>>>>> 0.0299267 >> > >>>>>> 24 32 23803 23771 >> 3.86837 1.45703 0.00301471 >> > >>>>>> 0.0312804 >> > >>>>>> 25 32 24112 24080 >> 3.76191 1.20703 0.00191063 >> > >>>>>> 0.0331462 >> > >>>>>> 26 31 25303 25272 >> 3.79629 4.65625 0.00794399 >> > >>>>>> 0.0329129 >> > >>>>>> 27 32 28803 28771 >> 4.16183 13.668 0.0109817 >> > >>>>>> 0.0297469 >> > >>>>>> 28 32 29592 29560 >> 4.12325 3.08203 0.00188185 >> > >>>>>> 0.0301911 >> > >>>>>> 29 32 30595 30563 >> 4.11616 3.91797 0.00379099 >> > >>>>>> 0.0296794 >> > >>>>>> 30 32 31031 30999 >> 4.03572 1.70312 0.00283347 >> > >>>>>> 0.0302411 >> > >>>>>> Total time run: 30.822350 >> > >>>>>> Total writes made: 31032 >> > >>>>>> Write size: 4096 >> > >>>>>> Object size: 4096 >> > >>>>>> Bandwidth (MB/sec): 3.93282 >> > >>>>>> Stddev Bandwidth: 3.66265 >> > >>>>>> Max bandwidth (MB/sec): 13.668 >> > >>>>>> Min bandwidth (MB/sec): 0 >> > >>>>>> Average IOPS: 1006 >> > >>>>>> Stddev IOPS: 937 >> > >>>>>> Max IOPS: 3499 >> > >>>>>> Min IOPS: 0 >> > >>>>>> Average Latency(s): 0.0317779 >> > >>>>>> Stddev Latency(s): 0.164076 >> > >>>>>> Max latency(s): 2.27707 >> > >>>>>> Min latency(s): 0.0013848 >> > >>>>>> Cleaning up (deleting benchmark objects) >> > >>>>>> Clean up completed and total clean up >> time :20.166559 >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> On Wed, Oct 18, 2017 at 8:51 AM, Maged >> Mokhtar <mmokhtar@xxxxxxxxxxx >> <mailto:mmokhtar@xxxxxxxxxxx>> >> > >>>>>> wrote: >> > >>>>>> >> > >>>>>>> First a general comment: local RAID >> will be faster than Ceph for a >> > >>>>>>> single threaded (queue depth=1) io >> operation test. A single thread Ceph >> > >>>>>>> client will see at best same disk speed >> for reads and for writes 4-6 times >> > >>>>>>> slower than single disk. Not to mention >> the latency of local disks will >> > >>>>>>> much better. Where Ceph shines is when >> you have many concurrent ios, it >> > >>>>>>> scales whereas RAID will decrease speed >> per client as you add more. >> > >>>>>>> >> > >>>>>>> Having said that, i would recommend >> running rados/rbd bench-write >> > >>>>>>> and measure 4k iops at 1 and 32 threads >> to get a better idea of how your >> > >>>>>>> cluster performs: >> > >>>>>>> >> > >>>>>>> ceph osd pool create testpool 256 256 >> > >>>>>>> rados bench -p testpool -b 4096 30 >> write -t 1 >> > >>>>>>> rados bench -p testpool -b 4096 30 >> write -t 32 >> > >>>>>>> ceph osd pool delete testpool testpool >> --yes-i-really-really-mean-it >> > >>>>>>> >> > >>>>>>> rbd bench-write test-image >> --io-threads=1 --io-size 4096 >> > >>>>>>> --io-pattern rand --rbd_cache=false >> > >>>>>>> rbd bench-write test-image >> --io-threads=32 --io-size 4096 >> > >>>>>>> --io-pattern rand --rbd_cache=false >> > >>>>>>> >> > >>>>>>> I think the request size difference you >> see is due to the io >> > >>>>>>> scheduler in the case of local disks >> having more ios to re-group so has a >> > >>>>>>> better chance in generating larger >> requests. Depending on your kernel, the >> > >>>>>>> io scheduler may be different for rbd >> (blq-mq) vs sdx (cfq) but again i >> > >>>>>>> would think the request size is a >> result not a cause. >> > >>>>>>> >> > >>>>>>> Maged >> > >>>>>>> >> > >>>>>>> On 2017-10-17 23:12, Russell Glaue wrote: >> > >>>>>>> >> > >>>>>>> I am running ceph jewel on 5 nodes with >> SSD OSDs. >> > >>>>>>> I have an LVM image on a local RAID of >> spinning disks. >> > >>>>>>> I have an RBD image on in a pool of SSD >> disks. >> > >>>>>>> Both disks are used to run an almost >> identical CentOS 7 system. >> > >>>>>>> Both systems were installed with the >> same kickstart, though the disk >> > >>>>>>> partitioning is different. >> > >>>>>>> >> > >>>>>>> I want to make writes on the the ceph >> image faster. For example, >> > >>>>>>> lots of writes to MySQL (via MySQL >> replication) on a ceph SSD image are >> > >>>>>>> about 10x slower than on a spindle RAID >> disk image. The MySQL server on >> > >>>>>>> ceph rbd image has a hard time keeping >> up in replication. >> > >>>>>>> >> > >>>>>>> So I wanted to test writes on these two >> systems >> > >>>>>>> I have a 10GB compressed (gzip) file on >> both servers. >> > >>>>>>> I simply gunzip the file on both >> systems, while running iostat. >> > >>>>>>> >> > >>>>>>> The primary difference I see in the >> results is the average size of >> > >>>>>>> the request to the disk. >> > >>>>>>> CentOS7-lvm-raid-sata writes a lot >> faster to disk, and the size of >> > >>>>>>> the request is about 40x, but the >> number of writes per second is about the >> > >>>>>>> same >> > >>>>>>> This makes me want to conclude that the >> smaller size of the request >> > >>>>>>> for CentOS7-ceph-rbd-ssd system is the >> cause of it being slow. >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> How can I make the size of the request >> larger for ceph rbd images, >> > >>>>>>> so I can increase the write throughput? >> > >>>>>>> Would this be related to having jumbo >> packets enabled in my ceph >> > >>>>>>> storage network? >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> Here is a sample of the results: >> > >>>>>>> >> > >>>>>>> [CentOS7-lvm-raid-sata] >> > >>>>>>> $ gunzip large10gFile.gz & >> > >>>>>>> $ iostat -x vg_root-lv_var -d 5 -m -N >> > >>>>>>> Device: rrqm/s wrqm/s >> r/s w/s rMB/s wMB/s >> > >>>>>>> avgrq-sz avgqu-sz await r_await >> w_await svctm %util >> > >>>>>>> ... >> > >>>>>>> vg_root-lv_var 0.00 0.00 >> 30.60 452.20 13.60 222.15 >> > >>>>>>> 1000.04 8.69 14.05 0.99 >> 14.93 2.07 100.04 >> > >>>>>>> vg_root-lv_var 0.00 0.00 >> 88.20 182.00 39.20 89.43 >> > >>>>>>> 974.95 4.65 9.82 0.99 >> 14.10 3.70 100.00 >> > >>>>>>> vg_root-lv_var 0.00 0.00 >> 75.45 278.24 33.53 136.70 >> > >>>>>>> 985.73 4.36 33.26 1.34 >> 41.91 0.59 20.84 >> > >>>>>>> vg_root-lv_var 0.00 0.00 >> 111.60 181.80 49.60 89.34 >> > >>>>>>> 969.84 2.60 8.87 0.81 >> 13.81 0.13 3.90 >> > >>>>>>> vg_root-lv_var 0.00 0.00 >> 68.40 109.60 30.40 53.63 >> > >>>>>>> 966.87 1.51 8.46 0.84 >> 13.22 0.80 14.16 >> > >>>>>>> ... >> > >>>>>>> >> > >>>>>>> [CentOS7-ceph-rbd-ssd] >> > >>>>>>> $ gunzip large10gFile.gz & >> > >>>>>>> $ iostat -x vg_root-lv_data -d 5 -m -N >> > >>>>>>> Device: rrqm/s wrqm/s >> r/s w/s rMB/s wMB/s >> > >>>>>>> avgrq-sz avgqu-sz await r_await >> w_await svctm %util >> > >>>>>>> ... >> > >>>>>>> vg_root-lv_data 0.00 0.00 >> 46.40 167.80 0.88 1.46 >> > >>>>>>> 22.36 1.23 5.66 2.47 >> 6.54 4.52 96.82 >> > >>>>>>> vg_root-lv_data 0.00 0.00 >> 16.60 55.20 0.36 0.14 >> > >>>>>>> 14.44 0.99 13.91 9.12 >> 15.36 13.71 98.46 >> > >>>>>>> vg_root-lv_data 0.00 0.00 >> 69.00 173.80 1.34 1.32 >> > >>>>>>> 22.48 1.25 5.19 3.77 >> 5.75 3.94 95.68 >> > >>>>>>> vg_root-lv_data 0.00 0.00 >> 74.40 293.40 1.37 1.47 >> > >>>>>>> 15.83 1.22 3.31 2.06 >> 3.63 2.54 93.26 >> > >>>>>>> vg_root-lv_data 0.00 0.00 >> 90.80 359.00 1.96 3.41 >> > >>>>>>> 24.45 1.63 3.63 1.94 >> 4.05 2.10 94.38 >> > >>>>>>> ... >> > >>>>>>> >> > >>>>>>> [iostat key] >> > >>>>>>> w/s == The number (after merges) of >> write requests completed per >> > >>>>>>> second for the device. >> > >>>>>>> wMB/s == The number of sectors >> (kilobytes, megabytes) written to the >> > >>>>>>> device per second. >> > >>>>>>> avgrq-sz == The average size (in >> kilobytes) of the requests that >> > >>>>>>> were issued to the device. >> > >>>>>>> avgqu-sz == The average queue length of >> the requests that were >> > >>>>>>> issued to the device. >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> _______________________________________________ >> > >>>>>>> ceph-users mailing list >> > >>>>>>> ceph-users@xxxxxxxxxxxxxx >> <mailto:ceph-users@xxxxxxxxxxxxxx> >> > >>>>>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>> >> > >>>> >> > >>>> >> > >>> >> > >>> _______________________________________________ >> > >>> ceph-users mailing list >> > >>> ceph-users@xxxxxxxxxxxxxx >> <mailto:ceph-users@xxxxxxxxxxxxxx> >> > >>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >> > >>> >> > >> >> > > _______________________________________________ >> > > ceph-users mailing list >> > > ceph-users@xxxxxxxxxxxxxx >> <mailto:ceph-users@xxxxxxxxxxxxxx> >> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >> > > >> > > >> > > >> >> >> -- >> Christian Balzer Network/Systems Engineer >> chibi@xxxxxxx <mailto:chibi@xxxxxxx> >> Rakuten Communications >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> <mailto:ceph-users@xxxxxxxxxxxxxx> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> <mailto:ceph-users@xxxxxxxxxxxxxx> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >> >> >> >> >> > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com