Have you ruled out the disk controller and backplane in the server running slower?On Thu, Oct 19, 2017 at 4:42 PM Russell Glaue <rglaue@xxxxxxxx> wrote:I ran the test on the Ceph pool, and ran atop on all 4 storage servers, as suggested.Out of the 4 servers:3 of them performed with 17% to 30% disk %busy, and 11% CPU wait. Momentarily spiking up to 50% on one server, and 80% on anotherThe 2nd newest server was almost averaging 90% disk %busy and 150% CPU wait. And more than momentarily spiking to 101% disk busy and 250% CPU wait.For this 2nd newest server, this was the statistics for about 8 of 9 disks, with the 9th disk not far behind the others.I cannot believe all 9 disks are badThey are the same disks as the newest 1st server, Crucial_CT960M500SSD1, and same exact server hardware too.They were purchased at the same time in the same purchase order and arrived at the same time.So I cannot believe I just happened to put 9 bad disks in one server, and 9 good ones in the other.I know I have Ceph configured exactly the same on all serversAnd I am sure I have the hardware settings configured exactly the same on the 1st and 2nd servers.So if I were someone else, I would say it maybe is bad hardware on the 2nd server.But the 2nd server is running very well without any hint of a problem.Any other ideas or suggestions?-RG______________________________On Wed, Oct 18, 2017 at 3:40 PM, Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:just run the same 32 threaded rados test as you did before and this time run atop while the test is running looking for %busy of cpu/disks. It should give an idea if there is a bottleneck in them.
On 2017-10-18 21:35, Russell Glaue wrote:
I cannot run the write test reviewed at the ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal- device blog. The tests write directly to the raw disk device. Reading an infile (created with urandom) on one SSD, writing the outfile to another osd, yields about 17MB/s.But Isn't this write speed limited by the speed in which in the dd infile can be read?And I assume the best test should be run with no other load.
How does one run the rados bench "as stress"?-RG
On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:
measuring resource load as outlined earlier will show if the drives are performing well or not. Also how many osds do you have ?
On 2017-10-18 19:26, Russell Glaue wrote:
The SSD drives are Crucial M500A Ceph user did some benchmarks and found it had good performanceHowever, a user comment from 3 years ago on the blog post you linked to says to avoid the Crucial M500Yet, this performance posting tells that the Crucial M500 is good.
On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:
Check out the following link: some SSDs perform bad in Ceph due to sync writes to journal
Anther thing that can help is to re-run the rados 32 threads as stress and view resource usage using atop (or collectl/sar) to check for %busy cpu and %busy disks to give you an idea of what is holding down your cluster..for example: if cpu/disk % are all low then check your network/switches. If disk %busy is high (90%) for all disks then your disks are the bottleneck: which either means you have SSDs that are not suitable for Ceph or you have too few disks (which i doubt is the case). If only 1 disk %busy is high, there may be something wrong with this disk should be removed.
Maged
On 2017-10-18 18:13, Russell Glaue wrote:
In my previous post, in one of my points I was wondering if the request size would increase if I enabled jumbo packets. currently it is disabled.@jdillama: The qemu settings for both these two guest machines, with RAID/LVM and Ceph/rbd images, are the same. I am not thinking that changing the qemu settings of "min_io_size=<limited to 16bits>,opt_io_size=<RBD image object size>" will directly address the issue.@mmokhtar: Ok. So you suggest the request size is the result of the problem and not the cause of the problem. meaning I should go after a different issue.I have been trying to get write speeds up to what people on this mail list are discussing.It seems that for our configuration, as it matches others, we should be getting about 70MB/s write speed.But we are not getting that.Single writes to disk are lucky to get 5MB/s to 6MB/s, but are typically 1MB/s to 2MB/s.Monitoring the entire Ceph cluster (using http://cephdash.crapworks.de/), I have seen very rare momentary spikes up to 30MB/s. My storage network is connected via a 10Gb switchI have 4 storage servers with a LSI Logic MegaRAID SAS 2208 controllerEach storage server has 9 1TB SSD drives, each drive as 1 osd (no RAID)Each drive is one LVM group, with two volumes - one volume for the osd, one volume for the journalEach osd is formatted with xfsThe crush map is simple: default->rack->[host[1..4]->osd] with an evenly distributed weight The redundancy is triple replicationWhile I have read comments that having the osd and journal on the same disk decreases write speed, I have also read that once past 8 OSDs per node this is the recommended configuration, however this is also the reason why SSD drives are used exclusively for OSDs in the storage nodes.None-the-less, I was still expecting write speeds to be above 30MB/s, not below 6MB/s.Even at 12x slower than the RAID, using my previously posted iostat data set, I should be seeing write speeds that average 10MB/s, not 2MB/s.In regards to the rados benchmark tests you asked me to run, here is the output:[centos7]# rados bench -p scbench -b 4096 30 write -t 1Maintaining 1 concurrent writes of 4096 bytes to objects of size 4096 for up to 30 seconds or 0 objectsObject prefix: benchmark_data_hamms.sys.cu.cait.org_85049 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)0 0 0 0 0 0 - 01 1 201 200 0.78356 0.78125 0.00522307 0.004965742 1 469 468 0.915303 1.04688 0.00437497 0.004261413 1 741 740 0.964371 1.0625 0.00512853 0.00404344 1 888 887 0.866739 0.574219 0.00307699 0.004501775 1 1147 1146 0.895725 1.01172 0.00376454 0.00435596 1 1325 1324 0.862293 0.695312 0.00459443 0.0045257 1 1494 1493 0.83339 0.660156 0.00461002 0.004584528 1 1736 1735 0.847369 0.945312 0.00253971 0.004604589 1 1998 1997 0.866922 1.02344 0.00236573 0.0045017210 1 2260 2259 0.882563 1.02344 0.00262179 0.0044215211 1 2526 2525 0.896775 1.03906 0.00336914 0.0043509212 1 2760 2759 0.898203 0.914062 0.00351827 0.0043449113 1 3016 3015 0.906025 1 0.00335703 0.0043069114 1 3257 3256 0.908545 0.941406 0.00332344 0.0042949515 1 3490 3489 0.908644 0.910156 0.00318815 0.0042638716 1 3728 3727 0.909952 0.929688 0.0032881 0.0042889517 1 3986 3985 0.915703 1.00781 0.00274809 0.004261418 1 4250 4249 0.922116 1.03125 0.00287411 0.0042321419 1 4505 4504 0.926003 0.996094 0.00375435 0.004214422017-10-18 10:56:31.267173 min lat: 0.00181259 max lat: 0.270553 avg lat: 0.00420118sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)20 1 4757 4756 0.928915 0.984375 0.00463972 0.0042011821 1 5009 5008 0.93155 0.984375 0.00360065 0.0041893722 1 5235 5234 0.929329 0.882812 0.00626214 0.00419923 1 5500 5499 0.933925 1.03516 0.00466584 0.0041783624 1 5708 5707 0.928861 0.8125 0.00285727 0.0042014625 0 5964 5964 0.931858 1.00391 0.00417383 0.004188126 1 6216 6215 0.933722 0.980469 0.0041009 0.0041791527 1 6481 6480 0.937474 1.03516 0.00307484 0.0041611828 1 6745 6744 0.940819 1.03125 0.00266329 0.0041477729 1 7003 7002 0.943124 1.00781 0.00305905 0.0041375830 1 7271 7270 0.946578 1.04688 0.00391017 0.00412238Total time run: 30.006060Total writes made: 7272Write size: 4096Object size: 4096Bandwidth (MB/sec): 0.946684Stddev Bandwidth: 0.123762Max bandwidth (MB/sec): 1.0625Min bandwidth (MB/sec): 0.574219Average IOPS: 242Stddev IOPS: 31Max IOPS: 272Min IOPS: 147Average Latency(s): 0.00412247Stddev Latency(s): 0.00648437Max latency(s): 0.270553Min latency(s): 0.00175318Cleaning up (deleting benchmark objects)Clean up completed and total clean up time :29.069423[centos7]# rados bench -p scbench -b 4096 30 write -t 32Maintaining 32 concurrent writes of 4096 bytes to objects of size 4096 for up to 30 seconds or 0 objectsObject prefix: benchmark_data_hamms.sys.cu.cait.org_86076 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)0 0 0 0 0 0 - 01 32 3013 2981 11.6438 11.6445 0.00247906 0.005720262 32 5349 5317 10.3834 9.125 0.00246662 0.009320163 32 5707 5675 7.3883 1.39844 0.00389774 0.01567264 32 5895 5863 5.72481 0.734375 1.13137 0.01679465 32 6869 6837 5.34068 3.80469 0.0027652 0.02265776 32 8901 8869 5.77306 7.9375 0.0053211 0.02162597 32 10800 10768 6.00785 7.41797 0.00358187 0.02074188 32 11825 11793 5.75728 4.00391 0.00217575 0.02154949 32 12941 12909 5.6019 4.35938 0.00278512 0.022056710 32 13317 13285 5.18849 1.46875 0.0034973 0.024066511 32 16189 16157 5.73653 11.2188 0.00255841 0.021270812 32 16749 16717 5.44077 2.1875 0.00330334 0.021591513 32 16756 16724 5.02436 0.0273438 0.00338994 0.02184914 32 17908 17876 4.98686 4.5 0.00402598 0.024456815 32 17936 17904 4.66171 0.109375 0.00375799 0.024554516 32 18279 18247 4.45409 1.33984 0.00483873 0.026792917 32 18372 18340 4.21346 0.363281 0.00505187 0.027588718 32 19403 19371 4.20309 4.02734 0.00545154 0.02934819 31 19845 19814 4.07295 1.73047 0.00254726 0.03067752017-10-18 10:57:58.160536 min lat: 0.0015005 max lat: 2.27707 avg lat: 0.0307559sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)20 31 20401 20370 3.97788 2.17188 0.00307238 0.030755921 32 21338 21306 3.96254 3.65625 0.00464563 0.031228822 32 23057 23025 4.0876 6.71484 0.00296295 0.029926723 32 23057 23025 3.90988 0 - 0.029926724 32 23803 23771 3.86837 1.45703 0.00301471 0.031280425 32 24112 24080 3.76191 1.20703 0.00191063 0.033146226 31 25303 25272 3.79629 4.65625 0.00794399 0.032912927 32 28803 28771 4.16183 13.668 0.0109817 0.029746928 32 29592 29560 4.12325 3.08203 0.00188185 0.030191129 32 30595 30563 4.11616 3.91797 0.00379099 0.029679430 32 31031 30999 4.03572 1.70312 0.00283347 0.0302411Total time run: 30.822350Total writes made: 31032Write size: 4096Object size: 4096Bandwidth (MB/sec): 3.93282Stddev Bandwidth: 3.66265Max bandwidth (MB/sec): 13.668Min bandwidth (MB/sec): 0Average IOPS: 1006Stddev IOPS: 937Max IOPS: 3499Min IOPS: 0Average Latency(s): 0.0317779Stddev Latency(s): 0.164076Max latency(s): 2.27707Min latency(s): 0.0013848Cleaning up (deleting benchmark objects)Clean up completed and total clean up time :20.166559
On Wed, Oct 18, 2017 at 8:51 AM, Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:
First a general comment: local RAID will be faster than Ceph for a single threaded (queue depth=1) io operation test. A single thread Ceph client will see at best same disk speed for reads and for writes 4-6 times slower than single disk. Not to mention the latency of local disks will much better. Where Ceph shines is when you have many concurrent ios, it scales whereas RAID will decrease speed per client as you add more.
Having said that, i would recommend running rados/rbd bench-write and measure 4k iops at 1 and 32 threads to get a better idea of how your cluster performs:
ceph osd pool create testpool 256 256
rados bench -p testpool -b 4096 30 write -t 1
rados bench -p testpool -b 4096 30 write -t 32
ceph osd pool delete testpool testpool --yes-i-really-really-mean-itrbd bench-write test-image --io-threads=1 --io-size 4096 --io-pattern rand --rbd_cache=false
rbd bench-write test-image --io-threads=32 --io-size 4096 --io-pattern rand --rbd_cache=falseI think the request size difference you see is due to the io scheduler in the case of local disks having more ios to re-group so has a better chance in generating larger requests. Depending on your kernel, the io scheduler may be different for rbd (blq-mq) vs sdx (cfq) but again i would think the request size is a result not a cause.
Maged
On 2017-10-17 23:12, Russell Glaue wrote:
I am running ceph jewel on 5 nodes with SSD OSDs.I have an LVM image on a local RAID of spinning disks.I have an RBD image on in a pool of SSD disks.
Both disks are used to run an almost identical CentOS 7 system.Both systems were installed with the same kickstart, though the disk partitioning is different.I want to make writes on the the ceph image faster. For example, lots of writes to MySQL (via MySQL replication) on a ceph SSD image are about 10x slower than on a spindle RAID disk image. The MySQL server on ceph rbd image has a hard time keeping up in replication.So I wanted to test writes on these two systemsI have a 10GB compressed (gzip) file on both servers.I simply gunzip the file on both systems, while running iostat.The primary difference I see in the results is the average size of the request to the disk.CentOS7-lvm-raid-sata writes a lot faster to disk, and the size of the request is about 40x, but the number of writes per second is about the sameThis makes me want to conclude that the smaller size of the request for CentOS7-ceph-rbd-ssd system is the cause of it being slow.How can I make the size of the request larger for ceph rbd images, so I can increase the write throughput?Would this be related to having jumbo packets enabled in my ceph storage network?Here is a sample of the results:[CentOS7-lvm-raid-sata]$ gunzip large10gFile.gz &$ iostat -x vg_root-lv_var -d 5 -m -NDevice: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util...vg_root-lv_var 0.00 0.00 30.60 452.20 13.60 222.15 1000.04 8.69 14.05 0.99 14.93 2.07 100.04vg_root-lv_var 0.00 0.00 88.20 182.00 39.20 89.43 974.95 4.65 9.82 0.99 14.10 3.70 100.00vg_root-lv_var 0.00 0.00 75.45 278.24 33.53 136.70 985.73 4.36 33.26 1.34 41.91 0.59 20.84vg_root-lv_var 0.00 0.00 111.60 181.80 49.60 89.34 969.84 2.60 8.87 0.81 13.81 0.13 3.90vg_root-lv_var 0.00 0.00 68.40 109.60 30.40 53.63 966.87 1.51 8.46 0.84 13.22 0.80 14.16...[CentOS7-ceph-rbd-ssd]$ gunzip large10gFile.gz &$ iostat -x vg_root-lv_data -d 5 -m -NDevice: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util...vg_root-lv_data 0.00 0.00 46.40 167.80 0.88 1.46 22.36 1.23 5.66 2.47 6.54 4.52 96.82vg_root-lv_data 0.00 0.00 16.60 55.20 0.36 0.14 14.44 0.99 13.91 9.12 15.36 13.71 98.46vg_root-lv_data 0.00 0.00 69.00 173.80 1.34 1.32 22.48 1.25 5.19 3.77 5.75 3.94 95.68vg_root-lv_data 0.00 0.00 74.40 293.40 1.37 1.47 15.83 1.22 3.31 2.06 3.63 2.54 93.26vg_root-lv_data 0.00 0.00 90.80 359.00 1.96 3.41 24.45 1.63 3.63 1.94 4.05 2.10 94.38...[iostat key]w/s == The number (after merges) of write requests completed per second for the device.wMB/s == The number of sectors (kilobytes, megabytes) written to the device per second.avgrq-sz == The average size (in kilobytes) of the requests that were issued to the device.avgqu-sz == The average queue length of the requests that were issued to the device._______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
_________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com