Hm, no necessarily directly related to your performance problem, however: These SSDs have a listed endurance of 72TB total data written - over a 5 year period that's 40GB a day or approx 0.04 DWPD. Given that you run the journal for each OSD on the same disk, that's effectively at most 0.02 DWPD (about 20GB per day per disk). I don't know many who'd run a cluster on disks like those. Also it means these are pure consumer drives which have a habit of exhibiting random performance at times (based on unquantified anecdotal personal experience with other consumer model SSDs). I wouldn't touch these with a long stick for anything but small toy-test clusters. On Fri, Oct 27, 2017 at 3:44 AM, Russell Glaue <rglaue@xxxxxxxx> wrote: > > On Wed, Oct 25, 2017 at 7:09 PM, Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote: >> >> It depends on what stage you are in: >> in production, probably the best thing is to setup a monitoring tool >> (collectd/grahite/prometheus/grafana) to monitor both ceph stats as well as >> resource load. This will, among other things, show you if you have slowing >> disks. > > I am monitoring Ceph performance with ceph-dash > (http://cephdash.crapworks.de/), that is why I knew to look into the slow > writes issue. And I am using Monitorix (http://www.monitorix.org/) to > monitor system resources, including Disk I/O. > > However, though I can monitor individual disk performance at the system > level, it seems Ceph does not tax any disk more than the worst disk. So in > my monitoring charts, all disks have the same performance. > All four nodes are base-lining at 50 writes/sec during the cluster's normal > load, with the non-problem hosts spiking up to 150, and the problem host > only spikes up to 100. > But during the window of time I took the problem host OSDs down to run the > bench tests, the OSDs on the other nodes increased to 300-500 writes/sec. > Otherwise, the chart looks the same for all disks on all ceph nodes/hosts. > >> Before production you should first make sure your SSDs are suitable for >> Ceph, either by being recommend by other Ceph users or you test them >> yourself for sync writes performance using fio tool as outlined earlier. >> Then after you build your cluster you can use rados and/or rbd bencmark >> tests to benchmark your cluster and find bottlenecks using atop/sar/collectl >> which will help you tune your cluster. > > All 36 OSDs are: Crucial_CT960M500SSD1 > > Rados bench tests were done at the beginning. The speed was much faster than > it is now. I cannot recall the test results, someone else on my team ran > them. Recently, I had thought the slow disk problem was a configuration > issue with Ceph - before I posted here. Now we are hoping it may be resolved > with a firmware update. (If it is firmware related, rebooting the problem > node may temporarily resolve this) > >> >> Though you did see better improvements, your cluster with 27 SSDs should >> give much higher numbers than 3k iops. If you are running rados bench while >> you have other client ios, then obviously the reported number by the tool >> will be less than what the cluster is actually giving...which you can find >> out via ceph status command, it will print the total cluster throughput and >> iops. If the total is still low i would recommend running the fio raw disk >> test, maybe the disks are not suitable. When you removed your 9 bad disk >> from 36 and your performance doubled, you still had 2 other disk slowing >> you..meaning near 100% busy ? It makes me feel the disk type used is not >> good. For these near 100% busy disks can you also measure their raw disk >> iops at that load (i am not sure atop shows this, if not use >> sat/syssyat/iostat/collecl). > > I ran another bench test today with all 36 OSDs up. The overall performance > was improved slightly compared to the original tests. Only 3 OSDs on the > problem host were increasing to 101% disk busy. > The iops reported from ceph status during this bench test ranged from 1.6k > to 3.3k, the test yielding 4k iops. > > Yes, the two other OSDs/disks that were the bottleneck were at 101% disk > busy. The other OSD disks on the same host were sailing along at like 50-60% > busy. > > All 36 OSD disks are exactly the same disk. They were all purchased at the > same time. All were installed at the same time. > I cannot believe it is a problem with the disk model. A failed/bad disk, > perhaps is possible. But the disk model itself cannot be the problem based > on what I am seeing. If I am seeing bad performance on all disks on one ceph > node/host, but not on another ceph node with these same disks, it has to be > some other factor. This is why I am now guessing a firmware upgrade is > needed. > > Also, as I eluded to here earlier. I took down all 9 OSDs in the problem > host yesterday to run the bench test. > Today, with those 9 OSDs back online, I rerun the bench test, I am see 2-3 > OSD disks with 101% busy on the problem host, and the other disks are lower > than 80%. So, for whatever reason, shutting down the OSDs and starting them > back up, allowed many (not all) of the OSDs performance to improve on the > problem host. > > >> Maged >> >> On 2017-10-25 23:44, Russell Glaue wrote: >> >> Thanks to all. >> I took the OSDs down in the problem host, without shutting down the >> machine. >> As predicted, our MB/s about doubled. >> Using this bench/atop procedure, I found two other OSDs on another host >> that are the next bottlenecks. >> >> Is this the only good way to really test the performance of the drives as >> OSDs? Is there any other way? >> >> While running the bench on all 36 OSDs, the 9 problem OSDs stuck out. But >> two new problem OSDs I just discovered in this recent test of 27 OSDs did >> not stick out at all. Because ceph bench distributes the load making only >> the very worst denominators show up in atop. So ceph is a slow as your >> slowest drive. >> >> It would be really great if I could run the bench test, and some how get >> the bench to use only certain OSDs during the test. Then I could run the >> test, avoiding the OSDs that I already know is a problem, so I can find the >> next worst OSD. >> >> >> [ the bench test ] >> rados bench -p scbench -b 4096 30 write -t 32 >> >> [ original results with all 36 OSDs ] >> Total time run: 30.822350 >> Total writes made: 31032 >> Write size: 4096 >> Object size: 4096 >> Bandwidth (MB/sec): 3.93282 >> Stddev Bandwidth: 3.66265 >> Max bandwidth (MB/sec): 13.668 >> Min bandwidth (MB/sec): 0 >> Average IOPS: 1006 >> Stddev IOPS: 937 >> Max IOPS: 3499 >> Min IOPS: 0 >> Average Latency(s): 0.0317779 >> Stddev Latency(s): 0.164076 >> Max latency(s): 2.27707 >> Min latency(s): 0.0013848 >> Cleaning up (deleting benchmark objects) >> Clean up completed and total clean up time :20.166559 >> >> [ after stopping all of the OSDs (9) on the problem host ] >> Total time run: 32.586830 >> Total writes made: 59491 >> Write size: 4096 >> Object size: 4096 >> Bandwidth (MB/sec): 7.13131 >> Stddev Bandwidth: 9.78725 >> Max bandwidth (MB/sec): 29.168 >> Min bandwidth (MB/sec): 0 >> Average IOPS: 1825 >> Stddev IOPS: 2505 >> Max IOPS: 7467 >> Min IOPS: 0 >> Average Latency(s): 0.0173691 >> Stddev Latency(s): 0.21634 >> Max latency(s): 6.71283 >> Min latency(s): 0.00107473 >> Cleaning up (deleting benchmark objects) >> Clean up completed and total clean up time :16.269393 >> >> >> >> On Fri, Oct 20, 2017 at 1:35 PM, Russell Glaue <rglaue@xxxxxxxx> wrote: >>> >>> On the machine in question, the 2nd newest, we are using the LSI MegaRAID >>> SAS-3 3008 [Fury], which allows us a "Non-RAID" option, and has no battery. >>> The older two use the LSI MegaRAID SAS 2208 [Thunderbolt] I reported >>> earlier, each single drive configured as RAID0. >>> >>> Thanks for everyone's help. >>> I am going to run a 32 thread bench test after taking the 2nd machine out >>> of the cluster with noout. >>> After it is out of the cluster, I am expecting the slow write issue will >>> not surface. >>> >>> >>> On Fri, Oct 20, 2017 at 5:27 AM, David Turner <drakonstein@xxxxxxxxx> >>> wrote: >>>> >>>> I can attest that the battery in the raid controller is a thing. I'm >>>> used to using lsi controllers, but my current position has hp raid >>>> controllers and we just tracked down 10 of our nodes that had >100ms await >>>> pretty much always were the only 10 nodes in the cluster with failed >>>> batteries on the raid controllers. >>>> >>>> >>>> On Thu, Oct 19, 2017, 8:15 PM Christian Balzer <chibi@xxxxxxx> wrote: >>>>> >>>>> >>>>> Hello, >>>>> >>>>> On Thu, 19 Oct 2017 17:14:17 -0500 Russell Glaue wrote: >>>>> >>>>> > That is a good idea. >>>>> > However, a previous rebalancing processes has brought performance of >>>>> > our >>>>> > Guest VMs to a slow drag. >>>>> > >>>>> >>>>> Never mind that I'm not sure that these SSDs are particular well suited >>>>> for Ceph, your problem is clearly located on that one node. >>>>> >>>>> Not that I think it's the case, but make sure your PG distribution is >>>>> not >>>>> skewed with many more PGs per OSD on that node. >>>>> >>>>> Once you rule that out my first guess is the RAID controller, you're >>>>> running the SSDs are single RAID0s I presume? >>>>> If so a either configuration difference or a failed BBU on the >>>>> controller >>>>> could result in the writeback cache being disabled, which would explain >>>>> things beautifully. >>>>> >>>>> As for a temporary test/fix (with reduced redundancy of course), set >>>>> noout >>>>> (or mon_osd_down_out_subtree_limit accordingly) and turn the slow host >>>>> off. >>>>> >>>>> This should result in much better performance than you have now and of >>>>> course be the final confirmation of that host being the culprit. >>>>> >>>>> Christian >>>>> >>>>> > >>>>> > On Thu, Oct 19, 2017 at 3:55 PM, Jean-Charles Lopez >>>>> > <jelopez@xxxxxxxxxx> >>>>> > wrote: >>>>> > >>>>> > > Hi Russell, >>>>> > > >>>>> > > as you have 4 servers, assuming you are not doing EC pools, just >>>>> > > stop all >>>>> > > the OSDs on the second questionable server, mark the OSDs on that >>>>> > > server as >>>>> > > out, let the cluster rebalance and when all PGs are active+clean >>>>> > > just >>>>> > > replay the test. >>>>> > > >>>>> > > All IOs should then go only to the other 3 servers. >>>>> > > >>>>> > > JC >>>>> > > >>>>> > > On Oct 19, 2017, at 13:49, Russell Glaue <rglaue@xxxxxxxx> wrote: >>>>> > > >>>>> > > No, I have not ruled out the disk controller and backplane making >>>>> > > the >>>>> > > disks slower. >>>>> > > Is there a way I could test that theory, other than swapping out >>>>> > > hardware? >>>>> > > -RG >>>>> > > >>>>> > > On Thu, Oct 19, 2017 at 3:44 PM, David Turner >>>>> > > <drakonstein@xxxxxxxxx> >>>>> > > wrote: >>>>> > > >>>>> > >> Have you ruled out the disk controller and backplane in the server >>>>> > >> running slower? >>>>> > >> >>>>> > >> On Thu, Oct 19, 2017 at 4:42 PM Russell Glaue <rglaue@xxxxxxxx> >>>>> > >> wrote: >>>>> > >> >>>>> > >>> I ran the test on the Ceph pool, and ran atop on all 4 storage >>>>> > >>> servers, >>>>> > >>> as suggested. >>>>> > >>> >>>>> > >>> Out of the 4 servers: >>>>> > >>> 3 of them performed with 17% to 30% disk %busy, and 11% CPU wait. >>>>> > >>> Momentarily spiking up to 50% on one server, and 80% on another >>>>> > >>> The 2nd newest server was almost averaging 90% disk %busy and >>>>> > >>> 150% CPU >>>>> > >>> wait. And more than momentarily spiking to 101% disk busy and >>>>> > >>> 250% CPU wait. >>>>> > >>> For this 2nd newest server, this was the statistics for about 8 >>>>> > >>> of 9 >>>>> > >>> disks, with the 9th disk not far behind the others. >>>>> > >>> >>>>> > >>> I cannot believe all 9 disks are bad >>>>> > >>> They are the same disks as the newest 1st server, >>>>> > >>> Crucial_CT960M500SSD1, >>>>> > >>> and same exact server hardware too. >>>>> > >>> They were purchased at the same time in the same purchase order >>>>> > >>> and >>>>> > >>> arrived at the same time. >>>>> > >>> So I cannot believe I just happened to put 9 bad disks in one >>>>> > >>> server, >>>>> > >>> and 9 good ones in the other. >>>>> > >>> >>>>> > >>> I know I have Ceph configured exactly the same on all servers >>>>> > >>> And I am sure I have the hardware settings configured exactly the >>>>> > >>> same >>>>> > >>> on the 1st and 2nd servers. >>>>> > >>> So if I were someone else, I would say it maybe is bad hardware >>>>> > >>> on the >>>>> > >>> 2nd server. >>>>> > >>> But the 2nd server is running very well without any hint of a >>>>> > >>> problem. >>>>> > >>> >>>>> > >>> Any other ideas or suggestions? >>>>> > >>> >>>>> > >>> -RG >>>>> > >>> >>>>> > >>> >>>>> > >>> On Wed, Oct 18, 2017 at 3:40 PM, Maged Mokhtar >>>>> > >>> <mmokhtar@xxxxxxxxxxx> >>>>> > >>> wrote: >>>>> > >>> >>>>> > >>>> just run the same 32 threaded rados test as you did before and >>>>> > >>>> this >>>>> > >>>> time run atop while the test is running looking for %busy of >>>>> > >>>> cpu/disks. It >>>>> > >>>> should give an idea if there is a bottleneck in them. >>>>> > >>>> >>>>> > >>>> On 2017-10-18 21:35, Russell Glaue wrote: >>>>> > >>>> >>>>> > >>>> I cannot run the write test reviewed at the >>>>> > >>>> ceph-how-to-test-if-your-s >>>>> > >>>> sd-is-suitable-as-a-journal-device blog. The tests write >>>>> > >>>> directly to >>>>> > >>>> the raw disk device. >>>>> > >>>> Reading an infile (created with urandom) on one SSD, writing the >>>>> > >>>> outfile to another osd, yields about 17MB/s. >>>>> > >>>> But Isn't this write speed limited by the speed in which in the >>>>> > >>>> dd >>>>> > >>>> infile can be read? >>>>> > >>>> And I assume the best test should be run with no other load. >>>>> > >>>> >>>>> > >>>> How does one run the rados bench "as stress"? >>>>> > >>>> >>>>> > >>>> -RG >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar >>>>> > >>>> <mmokhtar@xxxxxxxxxxx> >>>>> > >>>> wrote: >>>>> > >>>> >>>>> > >>>>> measuring resource load as outlined earlier will show if the >>>>> > >>>>> drives >>>>> > >>>>> are performing well or not. Also how many osds do you have ? >>>>> > >>>>> >>>>> > >>>>> On 2017-10-18 19:26, Russell Glaue wrote: >>>>> > >>>>> >>>>> > >>>>> The SSD drives are Crucial M500 >>>>> > >>>>> A Ceph user did some benchmarks and found it had good >>>>> > >>>>> performance >>>>> > >>>>> https://forum.proxmox.com/threads/ceph-bad-performance-in- >>>>> > >>>>> qemu-guests.21551/ >>>>> > >>>>> >>>>> > >>>>> However, a user comment from 3 years ago on the blog post you >>>>> > >>>>> linked >>>>> > >>>>> to says to avoid the Crucial M500 >>>>> > >>>>> >>>>> > >>>>> Yet, this performance posting tells that the Crucial M500 is >>>>> > >>>>> good. >>>>> > >>>>> https://inside.servers.com/ssd-performance-2017-c4307a92dea >>>>> > >>>>> >>>>> > >>>>> On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar >>>>> > >>>>> <mmokhtar@xxxxxxxxxxx> >>>>> > >>>>> wrote: >>>>> > >>>>> >>>>> > >>>>>> Check out the following link: some SSDs perform bad in Ceph >>>>> > >>>>>> due to >>>>> > >>>>>> sync writes to journal >>>>> > >>>>>> >>>>> > >>>>>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-tes >>>>> > >>>>>> t-if-your-ssd-is-suitable-as-a-journal-device/ >>>>> > >>>>>> >>>>> > >>>>>> Anther thing that can help is to re-run the rados 32 threads >>>>> > >>>>>> as >>>>> > >>>>>> stress and view resource usage using atop (or collectl/sar) to >>>>> > >>>>>> check for >>>>> > >>>>>> %busy cpu and %busy disks to give you an idea of what is >>>>> > >>>>>> holding down your >>>>> > >>>>>> cluster..for example: if cpu/disk % are all low then check >>>>> > >>>>>> your >>>>> > >>>>>> network/switches. If disk %busy is high (90%) for all disks >>>>> > >>>>>> then your >>>>> > >>>>>> disks are the bottleneck: which either means you have SSDs >>>>> > >>>>>> that are not >>>>> > >>>>>> suitable for Ceph or you have too few disks (which i doubt is >>>>> > >>>>>> the case). If >>>>> > >>>>>> only 1 disk %busy is high, there may be something wrong with >>>>> > >>>>>> this disk >>>>> > >>>>>> should be removed. >>>>> > >>>>>> >>>>> > >>>>>> Maged >>>>> > >>>>>> >>>>> > >>>>>> On 2017-10-18 18:13, Russell Glaue wrote: >>>>> > >>>>>> >>>>> > >>>>>> In my previous post, in one of my points I was wondering if >>>>> > >>>>>> the >>>>> > >>>>>> request size would increase if I enabled jumbo packets. >>>>> > >>>>>> currently it is >>>>> > >>>>>> disabled. >>>>> > >>>>>> >>>>> > >>>>>> @jdillama: The qemu settings for both these two guest >>>>> > >>>>>> machines, with >>>>> > >>>>>> RAID/LVM and Ceph/rbd images, are the same. I am not thinking >>>>> > >>>>>> that changing >>>>> > >>>>>> the qemu settings of "min_io_size=<limited to >>>>> > >>>>>> 16bits>,opt_io_size=<RBD >>>>> > >>>>>> image object size>" will directly address the issue. >>>>> > >>>>>> >>>>> > >>>>>> @mmokhtar: Ok. So you suggest the request size is the result >>>>> > >>>>>> of the >>>>> > >>>>>> problem and not the cause of the problem. meaning I should go >>>>> > >>>>>> after a >>>>> > >>>>>> different issue. >>>>> > >>>>>> >>>>> > >>>>>> I have been trying to get write speeds up to what people on >>>>> > >>>>>> this mail >>>>> > >>>>>> list are discussing. >>>>> > >>>>>> It seems that for our configuration, as it matches others, we >>>>> > >>>>>> should >>>>> > >>>>>> be getting about 70MB/s write speed. >>>>> > >>>>>> But we are not getting that. >>>>> > >>>>>> Single writes to disk are lucky to get 5MB/s to 6MB/s, but are >>>>> > >>>>>> typically 1MB/s to 2MB/s. >>>>> > >>>>>> Monitoring the entire Ceph cluster (using >>>>> > >>>>>> http://cephdash.crapworks.de/), I have seen very rare >>>>> > >>>>>> momentary >>>>> > >>>>>> spikes up to 30MB/s. >>>>> > >>>>>> >>>>> > >>>>>> My storage network is connected via a 10Gb switch >>>>> > >>>>>> I have 4 storage servers with a LSI Logic MegaRAID SAS 2208 >>>>> > >>>>>> controller >>>>> > >>>>>> Each storage server has 9 1TB SSD drives, each drive as 1 osd >>>>> > >>>>>> (no >>>>> > >>>>>> RAID) >>>>> > >>>>>> Each drive is one LVM group, with two volumes - one volume for >>>>> > >>>>>> the >>>>> > >>>>>> osd, one volume for the journal >>>>> > >>>>>> Each osd is formatted with xfs >>>>> > >>>>>> The crush map is simple: default->rack->[host[1..4]->osd] with >>>>> > >>>>>> an >>>>> > >>>>>> evenly distributed weight >>>>> > >>>>>> The redundancy is triple replication >>>>> > >>>>>> >>>>> > >>>>>> While I have read comments that having the osd and journal on >>>>> > >>>>>> the >>>>> > >>>>>> same disk decreases write speed, I have also read that once >>>>> > >>>>>> past 8 OSDs per >>>>> > >>>>>> node this is the recommended configuration, however this is >>>>> > >>>>>> also the reason >>>>> > >>>>>> why SSD drives are used exclusively for OSDs in the storage >>>>> > >>>>>> nodes. >>>>> > >>>>>> None-the-less, I was still expecting write speeds to be above >>>>> > >>>>>> 30MB/s, >>>>> > >>>>>> not below 6MB/s. >>>>> > >>>>>> Even at 12x slower than the RAID, using my previously posted >>>>> > >>>>>> iostat >>>>> > >>>>>> data set, I should be seeing write speeds that average 10MB/s, >>>>> > >>>>>> not 2MB/s. >>>>> > >>>>>> >>>>> > >>>>>> In regards to the rados benchmark tests you asked me to run, >>>>> > >>>>>> here is >>>>> > >>>>>> the output: >>>>> > >>>>>> >>>>> > >>>>>> [centos7]# rados bench -p scbench -b 4096 30 write -t 1 >>>>> > >>>>>> Maintaining 1 concurrent writes of 4096 bytes to objects of >>>>> > >>>>>> size 4096 >>>>> > >>>>>> for up to 30 seconds or 0 objects >>>>> > >>>>>> Object prefix: benchmark_data_hamms.sys.cu.cait.org_85049 >>>>> > >>>>>> sec Cur ops started finished avg MB/s cur MB/s last >>>>> > >>>>>> lat(s) >>>>> > >>>>>> avg lat(s) >>>>> > >>>>>> 0 0 0 0 0 0 >>>>> > >>>>>> - >>>>> > >>>>>> 0 >>>>> > >>>>>> 1 1 201 200 0.78356 0.78125 >>>>> > >>>>>> 0.00522307 >>>>> > >>>>>> 0.00496574 >>>>> > >>>>>> 2 1 469 468 0.915303 1.04688 >>>>> > >>>>>> 0.00437497 >>>>> > >>>>>> 0.00426141 >>>>> > >>>>>> 3 1 741 740 0.964371 1.0625 >>>>> > >>>>>> 0.00512853 >>>>> > >>>>>> 0.0040434 >>>>> > >>>>>> 4 1 888 887 0.866739 0.574219 >>>>> > >>>>>> 0.00307699 >>>>> > >>>>>> 0.00450177 >>>>> > >>>>>> 5 1 1147 1146 0.895725 1.01172 >>>>> > >>>>>> 0.00376454 >>>>> > >>>>>> 0.0043559 >>>>> > >>>>>> 6 1 1325 1324 0.862293 0.695312 >>>>> > >>>>>> 0.00459443 >>>>> > >>>>>> 0.004525 >>>>> > >>>>>> 7 1 1494 1493 0.83339 0.660156 >>>>> > >>>>>> 0.00461002 >>>>> > >>>>>> 0.00458452 >>>>> > >>>>>> 8 1 1736 1735 0.847369 0.945312 >>>>> > >>>>>> 0.00253971 >>>>> > >>>>>> 0.00460458 >>>>> > >>>>>> 9 1 1998 1997 0.866922 1.02344 >>>>> > >>>>>> 0.00236573 >>>>> > >>>>>> 0.00450172 >>>>> > >>>>>> 10 1 2260 2259 0.882563 1.02344 >>>>> > >>>>>> 0.00262179 >>>>> > >>>>>> 0.00442152 >>>>> > >>>>>> 11 1 2526 2525 0.896775 1.03906 >>>>> > >>>>>> 0.00336914 >>>>> > >>>>>> 0.00435092 >>>>> > >>>>>> 12 1 2760 2759 0.898203 0.914062 >>>>> > >>>>>> 0.00351827 >>>>> > >>>>>> 0.00434491 >>>>> > >>>>>> 13 1 3016 3015 0.906025 1 >>>>> > >>>>>> 0.00335703 >>>>> > >>>>>> 0.00430691 >>>>> > >>>>>> 14 1 3257 3256 0.908545 0.941406 >>>>> > >>>>>> 0.00332344 >>>>> > >>>>>> 0.00429495 >>>>> > >>>>>> 15 1 3490 3489 0.908644 0.910156 >>>>> > >>>>>> 0.00318815 >>>>> > >>>>>> 0.00426387 >>>>> > >>>>>> 16 1 3728 3727 0.909952 0.929688 >>>>> > >>>>>> 0.0032881 >>>>> > >>>>>> 0.00428895 >>>>> > >>>>>> 17 1 3986 3985 0.915703 1.00781 >>>>> > >>>>>> 0.00274809 >>>>> > >>>>>> 0.0042614 >>>>> > >>>>>> 18 1 4250 4249 0.922116 1.03125 >>>>> > >>>>>> 0.00287411 >>>>> > >>>>>> 0.00423214 >>>>> > >>>>>> 19 1 4505 4504 0.926003 0.996094 >>>>> > >>>>>> 0.00375435 >>>>> > >>>>>> 0.00421442 >>>>> > >>>>>> 2017-10-18 10:56:31.267173 min lat: 0.00181259 max lat: >>>>> > >>>>>> 0.270553 avg >>>>> > >>>>>> lat: 0.00420118 >>>>> > >>>>>> sec Cur ops started finished avg MB/s cur MB/s last >>>>> > >>>>>> lat(s) >>>>> > >>>>>> avg lat(s) >>>>> > >>>>>> 20 1 4757 4756 0.928915 0.984375 >>>>> > >>>>>> 0.00463972 >>>>> > >>>>>> 0.00420118 >>>>> > >>>>>> 21 1 5009 5008 0.93155 0.984375 >>>>> > >>>>>> 0.00360065 >>>>> > >>>>>> 0.00418937 >>>>> > >>>>>> 22 1 5235 5234 0.929329 0.882812 >>>>> > >>>>>> 0.00626214 >>>>> > >>>>>> 0.004199 >>>>> > >>>>>> 23 1 5500 5499 0.933925 1.03516 >>>>> > >>>>>> 0.00466584 >>>>> > >>>>>> 0.00417836 >>>>> > >>>>>> 24 1 5708 5707 0.928861 0.8125 >>>>> > >>>>>> 0.00285727 >>>>> > >>>>>> 0.00420146 >>>>> > >>>>>> 25 0 5964 5964 0.931858 1.00391 >>>>> > >>>>>> 0.00417383 >>>>> > >>>>>> 0.0041881 >>>>> > >>>>>> 26 1 6216 6215 0.933722 0.980469 >>>>> > >>>>>> 0.0041009 >>>>> > >>>>>> 0.00417915 >>>>> > >>>>>> 27 1 6481 6480 0.937474 1.03516 >>>>> > >>>>>> 0.00307484 >>>>> > >>>>>> 0.00416118 >>>>> > >>>>>> 28 1 6745 6744 0.940819 1.03125 >>>>> > >>>>>> 0.00266329 >>>>> > >>>>>> 0.00414777 >>>>> > >>>>>> 29 1 7003 7002 0.943124 1.00781 >>>>> > >>>>>> 0.00305905 >>>>> > >>>>>> 0.00413758 >>>>> > >>>>>> 30 1 7271 7270 0.946578 1.04688 >>>>> > >>>>>> 0.00391017 >>>>> > >>>>>> 0.00412238 >>>>> > >>>>>> Total time run: 30.006060 >>>>> > >>>>>> Total writes made: 7272 >>>>> > >>>>>> Write size: 4096 >>>>> > >>>>>> Object size: 4096 >>>>> > >>>>>> Bandwidth (MB/sec): 0.946684 >>>>> > >>>>>> Stddev Bandwidth: 0.123762 >>>>> > >>>>>> Max bandwidth (MB/sec): 1.0625 >>>>> > >>>>>> Min bandwidth (MB/sec): 0.574219 >>>>> > >>>>>> Average IOPS: 242 >>>>> > >>>>>> Stddev IOPS: 31 >>>>> > >>>>>> Max IOPS: 272 >>>>> > >>>>>> Min IOPS: 147 >>>>> > >>>>>> Average Latency(s): 0.00412247 >>>>> > >>>>>> Stddev Latency(s): 0.00648437 >>>>> > >>>>>> Max latency(s): 0.270553 >>>>> > >>>>>> Min latency(s): 0.00175318 >>>>> > >>>>>> Cleaning up (deleting benchmark objects) >>>>> > >>>>>> Clean up completed and total clean up time :29.069423 >>>>> > >>>>>> >>>>> > >>>>>> [centos7]# rados bench -p scbench -b 4096 30 write -t 32 >>>>> > >>>>>> Maintaining 32 concurrent writes of 4096 bytes to objects of >>>>> > >>>>>> size >>>>> > >>>>>> 4096 for up to 30 seconds or 0 objects >>>>> > >>>>>> Object prefix: benchmark_data_hamms.sys.cu.cait.org_86076 >>>>> > >>>>>> sec Cur ops started finished avg MB/s cur MB/s last >>>>> > >>>>>> lat(s) >>>>> > >>>>>> avg lat(s) >>>>> > >>>>>> 0 0 0 0 0 0 >>>>> > >>>>>> - >>>>> > >>>>>> 0 >>>>> > >>>>>> 1 32 3013 2981 11.6438 11.6445 >>>>> > >>>>>> 0.00247906 >>>>> > >>>>>> 0.00572026 >>>>> > >>>>>> 2 32 5349 5317 10.3834 9.125 >>>>> > >>>>>> 0.00246662 >>>>> > >>>>>> 0.00932016 >>>>> > >>>>>> 3 32 5707 5675 7.3883 1.39844 >>>>> > >>>>>> 0.00389774 >>>>> > >>>>>> 0.0156726 >>>>> > >>>>>> 4 32 5895 5863 5.72481 0.734375 >>>>> > >>>>>> 1.13137 >>>>> > >>>>>> 0.0167946 >>>>> > >>>>>> 5 32 6869 6837 5.34068 3.80469 >>>>> > >>>>>> 0.0027652 >>>>> > >>>>>> 0.0226577 >>>>> > >>>>>> 6 32 8901 8869 5.77306 7.9375 >>>>> > >>>>>> 0.0053211 >>>>> > >>>>>> 0.0216259 >>>>> > >>>>>> 7 32 10800 10768 6.00785 7.41797 >>>>> > >>>>>> 0.00358187 >>>>> > >>>>>> 0.0207418 >>>>> > >>>>>> 8 32 11825 11793 5.75728 4.00391 >>>>> > >>>>>> 0.00217575 >>>>> > >>>>>> 0.0215494 >>>>> > >>>>>> 9 32 12941 12909 5.6019 4.35938 >>>>> > >>>>>> 0.00278512 >>>>> > >>>>>> 0.0220567 >>>>> > >>>>>> 10 32 13317 13285 5.18849 1.46875 >>>>> > >>>>>> 0.0034973 >>>>> > >>>>>> 0.0240665 >>>>> > >>>>>> 11 32 16189 16157 5.73653 11.2188 >>>>> > >>>>>> 0.00255841 >>>>> > >>>>>> 0.0212708 >>>>> > >>>>>> 12 32 16749 16717 5.44077 2.1875 >>>>> > >>>>>> 0.00330334 >>>>> > >>>>>> 0.0215915 >>>>> > >>>>>> 13 32 16756 16724 5.02436 0.0273438 >>>>> > >>>>>> 0.00338994 >>>>> > >>>>>> 0.021849 >>>>> > >>>>>> 14 32 17908 17876 4.98686 4.5 >>>>> > >>>>>> 0.00402598 >>>>> > >>>>>> 0.0244568 >>>>> > >>>>>> 15 32 17936 17904 4.66171 0.109375 >>>>> > >>>>>> 0.00375799 >>>>> > >>>>>> 0.0245545 >>>>> > >>>>>> 16 32 18279 18247 4.45409 1.33984 >>>>> > >>>>>> 0.00483873 >>>>> > >>>>>> 0.0267929 >>>>> > >>>>>> 17 32 18372 18340 4.21346 0.363281 >>>>> > >>>>>> 0.00505187 >>>>> > >>>>>> 0.0275887 >>>>> > >>>>>> 18 32 19403 19371 4.20309 4.02734 >>>>> > >>>>>> 0.00545154 >>>>> > >>>>>> 0.029348 >>>>> > >>>>>> 19 31 19845 19814 4.07295 1.73047 >>>>> > >>>>>> 0.00254726 >>>>> > >>>>>> 0.0306775 >>>>> > >>>>>> 2017-10-18 10:57:58.160536 min lat: 0.0015005 max lat: 2.27707 >>>>> > >>>>>> avg >>>>> > >>>>>> lat: 0.0307559 >>>>> > >>>>>> sec Cur ops started finished avg MB/s cur MB/s last >>>>> > >>>>>> lat(s) >>>>> > >>>>>> avg lat(s) >>>>> > >>>>>> 20 31 20401 20370 3.97788 2.17188 >>>>> > >>>>>> 0.00307238 >>>>> > >>>>>> 0.0307559 >>>>> > >>>>>> 21 32 21338 21306 3.96254 3.65625 >>>>> > >>>>>> 0.00464563 >>>>> > >>>>>> 0.0312288 >>>>> > >>>>>> 22 32 23057 23025 4.0876 6.71484 >>>>> > >>>>>> 0.00296295 >>>>> > >>>>>> 0.0299267 >>>>> > >>>>>> 23 32 23057 23025 3.90988 0 >>>>> > >>>>>> - >>>>> > >>>>>> 0.0299267 >>>>> > >>>>>> 24 32 23803 23771 3.86837 1.45703 >>>>> > >>>>>> 0.00301471 >>>>> > >>>>>> 0.0312804 >>>>> > >>>>>> 25 32 24112 24080 3.76191 1.20703 >>>>> > >>>>>> 0.00191063 >>>>> > >>>>>> 0.0331462 >>>>> > >>>>>> 26 31 25303 25272 3.79629 4.65625 >>>>> > >>>>>> 0.00794399 >>>>> > >>>>>> 0.0329129 >>>>> > >>>>>> 27 32 28803 28771 4.16183 13.668 >>>>> > >>>>>> 0.0109817 >>>>> > >>>>>> 0.0297469 >>>>> > >>>>>> 28 32 29592 29560 4.12325 3.08203 >>>>> > >>>>>> 0.00188185 >>>>> > >>>>>> 0.0301911 >>>>> > >>>>>> 29 32 30595 30563 4.11616 3.91797 >>>>> > >>>>>> 0.00379099 >>>>> > >>>>>> 0.0296794 >>>>> > >>>>>> 30 32 31031 30999 4.03572 1.70312 >>>>> > >>>>>> 0.00283347 >>>>> > >>>>>> 0.0302411 >>>>> > >>>>>> Total time run: 30.822350 >>>>> > >>>>>> Total writes made: 31032 >>>>> > >>>>>> Write size: 4096 >>>>> > >>>>>> Object size: 4096 >>>>> > >>>>>> Bandwidth (MB/sec): 3.93282 >>>>> > >>>>>> Stddev Bandwidth: 3.66265 >>>>> > >>>>>> Max bandwidth (MB/sec): 13.668 >>>>> > >>>>>> Min bandwidth (MB/sec): 0 >>>>> > >>>>>> Average IOPS: 1006 >>>>> > >>>>>> Stddev IOPS: 937 >>>>> > >>>>>> Max IOPS: 3499 >>>>> > >>>>>> Min IOPS: 0 >>>>> > >>>>>> Average Latency(s): 0.0317779 >>>>> > >>>>>> Stddev Latency(s): 0.164076 >>>>> > >>>>>> Max latency(s): 2.27707 >>>>> > >>>>>> Min latency(s): 0.0013848 >>>>> > >>>>>> Cleaning up (deleting benchmark objects) >>>>> > >>>>>> Clean up completed and total clean up time :20.166559 >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > >>>>>> On Wed, Oct 18, 2017 at 8:51 AM, Maged Mokhtar >>>>> > >>>>>> <mmokhtar@xxxxxxxxxxx> >>>>> > >>>>>> wrote: >>>>> > >>>>>> >>>>> > >>>>>>> First a general comment: local RAID will be faster than Ceph >>>>> > >>>>>>> for a >>>>> > >>>>>>> single threaded (queue depth=1) io operation test. A single >>>>> > >>>>>>> thread Ceph >>>>> > >>>>>>> client will see at best same disk speed for reads and for >>>>> > >>>>>>> writes 4-6 times >>>>> > >>>>>>> slower than single disk. Not to mention the latency of local >>>>> > >>>>>>> disks will >>>>> > >>>>>>> much better. Where Ceph shines is when you have many >>>>> > >>>>>>> concurrent ios, it >>>>> > >>>>>>> scales whereas RAID will decrease speed per client as you add >>>>> > >>>>>>> more. >>>>> > >>>>>>> >>>>> > >>>>>>> Having said that, i would recommend running rados/rbd >>>>> > >>>>>>> bench-write >>>>> > >>>>>>> and measure 4k iops at 1 and 32 threads to get a better idea >>>>> > >>>>>>> of how your >>>>> > >>>>>>> cluster performs: >>>>> > >>>>>>> >>>>> > >>>>>>> ceph osd pool create testpool 256 256 >>>>> > >>>>>>> rados bench -p testpool -b 4096 30 write -t 1 >>>>> > >>>>>>> rados bench -p testpool -b 4096 30 write -t 32 >>>>> > >>>>>>> ceph osd pool delete testpool testpool >>>>> > >>>>>>> --yes-i-really-really-mean-it >>>>> > >>>>>>> >>>>> > >>>>>>> rbd bench-write test-image --io-threads=1 --io-size 4096 >>>>> > >>>>>>> --io-pattern rand --rbd_cache=false >>>>> > >>>>>>> rbd bench-write test-image --io-threads=32 --io-size 4096 >>>>> > >>>>>>> --io-pattern rand --rbd_cache=false >>>>> > >>>>>>> >>>>> > >>>>>>> I think the request size difference you see is due to the io >>>>> > >>>>>>> scheduler in the case of local disks having more ios to >>>>> > >>>>>>> re-group so has a >>>>> > >>>>>>> better chance in generating larger requests. Depending on >>>>> > >>>>>>> your kernel, the >>>>> > >>>>>>> io scheduler may be different for rbd (blq-mq) vs sdx (cfq) >>>>> > >>>>>>> but again i >>>>> > >>>>>>> would think the request size is a result not a cause. >>>>> > >>>>>>> >>>>> > >>>>>>> Maged >>>>> > >>>>>>> >>>>> > >>>>>>> On 2017-10-17 23:12, Russell Glaue wrote: >>>>> > >>>>>>> >>>>> > >>>>>>> I am running ceph jewel on 5 nodes with SSD OSDs. >>>>> > >>>>>>> I have an LVM image on a local RAID of spinning disks. >>>>> > >>>>>>> I have an RBD image on in a pool of SSD disks. >>>>> > >>>>>>> Both disks are used to run an almost identical CentOS 7 >>>>> > >>>>>>> system. >>>>> > >>>>>>> Both systems were installed with the same kickstart, though >>>>> > >>>>>>> the disk >>>>> > >>>>>>> partitioning is different. >>>>> > >>>>>>> >>>>> > >>>>>>> I want to make writes on the the ceph image faster. For >>>>> > >>>>>>> example, >>>>> > >>>>>>> lots of writes to MySQL (via MySQL replication) on a ceph SSD >>>>> > >>>>>>> image are >>>>> > >>>>>>> about 10x slower than on a spindle RAID disk image. The MySQL >>>>> > >>>>>>> server on >>>>> > >>>>>>> ceph rbd image has a hard time keeping up in replication. >>>>> > >>>>>>> >>>>> > >>>>>>> So I wanted to test writes on these two systems >>>>> > >>>>>>> I have a 10GB compressed (gzip) file on both servers. >>>>> > >>>>>>> I simply gunzip the file on both systems, while running >>>>> > >>>>>>> iostat. >>>>> > >>>>>>> >>>>> > >>>>>>> The primary difference I see in the results is the average >>>>> > >>>>>>> size of >>>>> > >>>>>>> the request to the disk. >>>>> > >>>>>>> CentOS7-lvm-raid-sata writes a lot faster to disk, and the >>>>> > >>>>>>> size of >>>>> > >>>>>>> the request is about 40x, but the number of writes per second >>>>> > >>>>>>> is about the >>>>> > >>>>>>> same >>>>> > >>>>>>> This makes me want to conclude that the smaller size of the >>>>> > >>>>>>> request >>>>> > >>>>>>> for CentOS7-ceph-rbd-ssd system is the cause of it being >>>>> > >>>>>>> slow. >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> How can I make the size of the request larger for ceph rbd >>>>> > >>>>>>> images, >>>>> > >>>>>>> so I can increase the write throughput? >>>>> > >>>>>>> Would this be related to having jumbo packets enabled in my >>>>> > >>>>>>> ceph >>>>> > >>>>>>> storage network? >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> Here is a sample of the results: >>>>> > >>>>>>> >>>>> > >>>>>>> [CentOS7-lvm-raid-sata] >>>>> > >>>>>>> $ gunzip large10gFile.gz & >>>>> > >>>>>>> $ iostat -x vg_root-lv_var -d 5 -m -N >>>>> > >>>>>>> Device: rrqm/s wrqm/s r/s w/s rMB/s >>>>> > >>>>>>> wMB/s >>>>> > >>>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>> > >>>>>>> ... >>>>> > >>>>>>> vg_root-lv_var 0.00 0.00 30.60 452.20 13.60 >>>>> > >>>>>>> 222.15 >>>>> > >>>>>>> 1000.04 8.69 14.05 0.99 14.93 2.07 100.04 >>>>> > >>>>>>> vg_root-lv_var 0.00 0.00 88.20 182.00 39.20 >>>>> > >>>>>>> 89.43 >>>>> > >>>>>>> 974.95 4.65 9.82 0.99 14.10 3.70 100.00 >>>>> > >>>>>>> vg_root-lv_var 0.00 0.00 75.45 278.24 33.53 >>>>> > >>>>>>> 136.70 >>>>> > >>>>>>> 985.73 4.36 33.26 1.34 41.91 0.59 20.84 >>>>> > >>>>>>> vg_root-lv_var 0.00 0.00 111.60 181.80 49.60 >>>>> > >>>>>>> 89.34 >>>>> > >>>>>>> 969.84 2.60 8.87 0.81 13.81 0.13 3.90 >>>>> > >>>>>>> vg_root-lv_var 0.00 0.00 68.40 109.60 30.40 >>>>> > >>>>>>> 53.63 >>>>> > >>>>>>> 966.87 1.51 8.46 0.84 13.22 0.80 14.16 >>>>> > >>>>>>> ... >>>>> > >>>>>>> >>>>> > >>>>>>> [CentOS7-ceph-rbd-ssd] >>>>> > >>>>>>> $ gunzip large10gFile.gz & >>>>> > >>>>>>> $ iostat -x vg_root-lv_data -d 5 -m -N >>>>> > >>>>>>> Device: rrqm/s wrqm/s r/s w/s rMB/s >>>>> > >>>>>>> wMB/s >>>>> > >>>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>> > >>>>>>> ... >>>>> > >>>>>>> vg_root-lv_data 0.00 0.00 46.40 167.80 0.88 >>>>> > >>>>>>> 1.46 >>>>> > >>>>>>> 22.36 1.23 5.66 2.47 6.54 4.52 96.82 >>>>> > >>>>>>> vg_root-lv_data 0.00 0.00 16.60 55.20 0.36 >>>>> > >>>>>>> 0.14 >>>>> > >>>>>>> 14.44 0.99 13.91 9.12 15.36 13.71 98.46 >>>>> > >>>>>>> vg_root-lv_data 0.00 0.00 69.00 173.80 1.34 >>>>> > >>>>>>> 1.32 >>>>> > >>>>>>> 22.48 1.25 5.19 3.77 5.75 3.94 95.68 >>>>> > >>>>>>> vg_root-lv_data 0.00 0.00 74.40 293.40 1.37 >>>>> > >>>>>>> 1.47 >>>>> > >>>>>>> 15.83 1.22 3.31 2.06 3.63 2.54 93.26 >>>>> > >>>>>>> vg_root-lv_data 0.00 0.00 90.80 359.00 1.96 >>>>> > >>>>>>> 3.41 >>>>> > >>>>>>> 24.45 1.63 3.63 1.94 4.05 2.10 94.38 >>>>> > >>>>>>> ... >>>>> > >>>>>>> >>>>> > >>>>>>> [iostat key] >>>>> > >>>>>>> w/s == The number (after merges) of write requests completed >>>>> > >>>>>>> per >>>>> > >>>>>>> second for the device. >>>>> > >>>>>>> wMB/s == The number of sectors (kilobytes, megabytes) written >>>>> > >>>>>>> to the >>>>> > >>>>>>> device per second. >>>>> > >>>>>>> avgrq-sz == The average size (in kilobytes) of the requests >>>>> > >>>>>>> that >>>>> > >>>>>>> were issued to the device. >>>>> > >>>>>>> avgqu-sz == The average queue length of the requests that >>>>> > >>>>>>> were >>>>> > >>>>>>> issued to the device. >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> _______________________________________________ >>>>> > >>>>>>> ceph-users mailing list >>>>> > >>>>>>> ceph-users@xxxxxxxxxxxxxx >>>>> > >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> > >>> >>>>> > >>> _______________________________________________ >>>>> > >>> ceph-users mailing list >>>>> > >>> ceph-users@xxxxxxxxxxxxxx >>>>> > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> > >>> >>>>> > >> >>>>> > > _______________________________________________ >>>>> > > ceph-users mailing list >>>>> > > ceph-users@xxxxxxxxxxxxxx >>>>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> > > >>>>> > > >>>>> > > >>>>> >>>>> >>>>> -- >>>>> Christian Balzer Network/Systems Engineer >>>>> chibi@xxxxxxx Rakuten Communications >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com