Re: How to increase the size of requests written to a ceph image

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Fri, 27 Oct 2017 13:34:02 +0200

It is quiet likely related, things are pointing to bad disks. Probably the best thing is to plan for disk replacement, the sooner the better as it could get worse.

On 2017-10-27 02:22, Christian Wuerdig wrote:

Hm, no necessarily directly related to your performance problem,
 however: These SSDs have a listed endurance of 72TB total data written
 - over a 5 year period that's 40GB a day or approx 0.04 DWPD. Given
 that you run the journal for each OSD on the same disk, that's
 effectively at most 0.02 DWPD (about 20GB per day per disk). I don't
 know many who'd run a cluster on disks like those. Also it means these
 are pure consumer drives which have a habit of exhibiting random
 performance at times (based on unquantified anecdotal personal
 experience with other consumer model SSDs). I wouldn't touch these
 with a long stick for anything but small toy-test clusters.

 On Fri, Oct 27, 2017 at 3:44 AM, Russell Glaue <rglaue@xxxxxxxx> wrote:

 On Wed, Oct 25, 2017 at 7:09 PM, Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:

 It depends on what stage you are in:
 in production, probably the best thing is to setup a monitoring tool
 (collectd/grahite/prometheus/grafana) to monitor both ceph stats as well as
 resource load. This will, among other things, show you if you have slowing
 disks.

 I am monitoring Ceph performance with ceph-dash
 (http://cephdash.crapworks.de/), that is why I knew to look into the slow
 writes issue. And I am using Monitorix (http://www.monitorix.org/) to
 monitor system resources, including Disk I/O.

 However, though I can monitor individual disk performance at the system
 level, it seems Ceph does not tax any disk more than the worst disk. So in
 my monitoring charts, all disks have the same performance.
 All four nodes are base-lining at 50 writes/sec during the cluster's normal
 load, with the non-problem hosts spiking up to 150, and the problem host
 only spikes up to 100.
 But during the window of time I took the problem host OSDs down to run the
 bench tests, the OSDs on the other nodes increased to 300-500 writes/sec.
 Otherwise, the chart looks the same for all disks on all ceph nodes/hosts.

Before production you should first make sure your SSDs are suitable for
 Ceph, either by being recommend by other Ceph users or you test them
 yourself for sync writes performance using fio tool as outlined earlier.
 Then after you build your cluster you can use rados and/or rbd bencmark
 tests to benchmark your cluster and find bottlenecks using atop/sar/collectl
 which will help you tune your cluster.

 All 36 OSDs are: Crucial_CT960M500SSD1

 Rados bench tests were done at the beginning. The speed was much faster than
 it is now. I cannot recall the test results, someone else on my team ran
 them. Recently, I had thought the slow disk problem was a configuration
 issue with Ceph - before I posted here. Now we are hoping it may be resolved
 with a firmware update. (If it is firmware related, rebooting the problem
 node may temporarily resolve this)

 Though you did see better improvements, your cluster with 27 SSDs should
 give much higher numbers than 3k iops. If you are running rados bench while
 you have other client ios, then obviously the reported number by the tool
 will be less than what the cluster is actually giving...which you can find
 out via ceph status command, it will print the total cluster throughput and
 iops. If the total is still low i would recommend running the fio raw disk
 test, maybe the disks are not suitable. When you removed your 9 bad disk
 from 36 and your performance doubled, you still had 2 other disk slowing
 you..meaning near 100% busy ? It makes me feel the disk type used is not
 good. For these near 100% busy disks can you also measure their raw disk
 iops at that load (i am not sure atop shows this, if not use
 sat/syssyat/iostat/collecl).

 I ran another bench test today with all 36 OSDs up. The overall performance
 was improved slightly compared to the original tests. Only 3 OSDs on the
 problem host were increasing to 101% disk busy.
 The iops reported from ceph status during this bench test ranged from 1.6k
 to 3.3k, the test yielding 4k iops.

 Yes, the two other OSDs/disks that were the bottleneck were at 101% disk
 busy. The other OSD disks on the same host were sailing along at like 50-60%
 busy.

 All 36 OSD disks are exactly the same disk. They were all purchased at the
 same time. All were installed at the same time.
 I cannot believe it is a problem with the disk model. A failed/bad disk,
 perhaps is possible. But the disk model itself cannot be the problem based
 on what I am seeing. If I am seeing bad performance on all disks on one ceph
 node/host, but not on another ceph node with these same disks, it has to be
 some other factor. This is why I am now guessing a firmware upgrade is
 needed.

 Also, as I eluded to here earlier. I took down all 9 OSDs in the problem
 host yesterday to run the bench test.
 Today, with those 9 OSDs back online, I rerun the bench test, I am see 2-3
 OSD disks with 101% busy on the problem host, and the other disks are lower
 than 80%. So, for whatever reason, shutting down the OSDs and starting them
 back up, allowed many (not all) of the OSDs performance to improve on the
 problem host.

Maged

 On 2017-10-25 23:44, Russell Glaue wrote:

 Thanks to all.
 I took the OSDs down in the problem host, without shutting down the
 machine.
 As predicted, our MB/s about doubled.
 Using this bench/atop procedure, I found two other OSDs on another host
 that are the next bottlenecks.

 Is this the only good way to really test the performance of the drives as
 OSDs? Is there any other way?

 While running the bench on all 36 OSDs, the 9 problem OSDs stuck out. But
 two new problem OSDs I just discovered in this recent test of 27 OSDs did
 not stick out at all. Because ceph bench distributes the load making only
 the very worst denominators show up in atop. So ceph is a slow as your
 slowest drive.

 It would be really great if I could run the bench test, and some how get
 the bench to use only certain OSDs during the test. Then I could run the
 test, avoiding the OSDs that I already know is a problem, so I can find the
 next worst OSD.

 [ the bench test ]
 rados bench -p scbench -b 4096 30 write -t 32

 [ original results with all 36 OSDs ]
 Total time run:         30.822350
 Total writes made:      31032
 Write size:             4096
 Object size:            4096
 Bandwidth (MB/sec):     3.93282
 Stddev Bandwidth:       3.66265
 Max bandwidth (MB/sec): 13.668
 Min bandwidth (MB/sec): 0
 Average IOPS:           1006
 Stddev IOPS:            937
 Max IOPS:               3499
 Min IOPS:               0
 Average Latency(s):     0.0317779
 Stddev Latency(s):      0.164076
 Max latency(s):         2.27707
 Min latency(s):         0.0013848
 Cleaning up (deleting benchmark objects)
 Clean up completed and total clean up time :20.166559

 [ after stopping all of the OSDs (9) on the problem host ]
 Total time run:         32.586830
 Total writes made:      59491
 Write size:             4096
 Object size:            4096
 Bandwidth (MB/sec):     7.13131
 Stddev Bandwidth:       9.78725
 Max bandwidth (MB/sec): 29.168
 Min bandwidth (MB/sec): 0
 Average IOPS:           1825
 Stddev IOPS:            2505
 Max IOPS:               7467
 Min IOPS:               0
 Average Latency(s):     0.0173691
 Stddev Latency(s):      0.21634
 Max latency(s):         6.71283
 Min latency(s):         0.00107473
 Cleaning up (deleting benchmark objects)
 Clean up completed and total clean up time :16.269393

 On Fri, Oct 20, 2017 at 1:35 PM, Russell Glaue <rglaue@xxxxxxxx> wrote:

 On the machine in question, the 2nd newest, we are using the LSI MegaRAID
 SAS-3 3008 [Fury], which allows us a "Non-RAID" option, and has no battery.
 The older two use the LSI MegaRAID SAS 2208 [Thunderbolt] I reported
 earlier, each single drive configured as RAID0.

 Thanks for everyone's help.
 I am going to run a 32 thread bench test after taking the 2nd machine out
 of the cluster with noout.
 After it is out of the cluster, I am expecting the slow write issue will
 not surface.

 On Fri, Oct 20, 2017 at 5:27 AM, David Turner <drakonstein@xxxxxxxxx>
 wrote:

 I can attest that the battery in the raid controller is a thing. I'm
 used to using lsi controllers, but my current position has hp raid
 controllers and we just tracked down 10 of our nodes that had >100ms await
 pretty much always were the only 10 nodes in the cluster with failed
 batteries on the raid controllers.

 On Thu, Oct 19, 2017, 8:15 PM Christian Balzer <chibi@xxxxxxx> wrote:

 Hello,

 On Thu, 19 Oct 2017 17:14:17 -0500 Russell Glaue wrote:

That is a good idea.
 However, a previous rebalancing processes has brought performance of
 our
 Guest VMs to a slow drag.

 Never mind that I'm not sure that these SSDs are particular well suited
 for Ceph, your problem is clearly located on that one node.

 Not that I think it's the case, but make sure your PG distribution is
 not
 skewed with many more PGs per OSD on that node.

 Once you rule that out my first guess is the RAID controller, you're
 running the SSDs are single RAID0s I presume?
 If so a either configuration difference or a failed BBU on the
 controller
 could result in the writeback cache being disabled, which would explain
 things beautifully.

 As for a temporary test/fix (with reduced redundancy of course), set
 noout
 (or mon_osd_down_out_subtree_limit accordingly) and turn the slow host
 off.

 This should result in much better performance than you have now and of
 course be the final confirmation of that host being the culprit.

 Christian

 On Thu, Oct 19, 2017 at 3:55 PM, Jean-Charles Lopez
 <jelopez@xxxxxxxxxx>
 wrote:

Hi Russell,

 as you have 4 servers, assuming you are not doing EC pools, just
 stop all
 the OSDs on the second questionable server, mark the OSDs on that
 server as
 out, let the cluster rebalance and when all PGs are active+clean
 just
 replay the test.

 All IOs should then go only to the other 3 servers.

 JC

 On Oct 19, 2017, at 13:49, Russell Glaue <rglaue@xxxxxxxx> wrote:

 No, I have not ruled out the disk controller and backplane making
 the
 disks slower.
 Is there a way I could test that theory, other than swapping out
 hardware?
 -RG

 On Thu, Oct 19, 2017 at 3:44 PM, David Turner
 <drakonstein@xxxxxxxxx>
 wrote:

Have you ruled out the disk controller and backplane in the server
 running slower?

 On Thu, Oct 19, 2017 at 4:42 PM Russell Glaue <rglaue@xxxxxxxx>
 wrote:

I ran the test on the Ceph pool, and ran atop on all 4 storage
 servers,
 as suggested.

 Out of the 4 servers:
 3 of them performed with 17% to 30% disk %busy, and 11% CPU wait.
 Momentarily spiking up to 50% on one server, and 80% on another
 The 2nd newest server was almost averaging 90% disk %busy and
 150% CPU
 wait. And more than momentarily spiking to 101% disk busy and
 250% CPU wait.
 For this 2nd newest server, this was the statistics for about 8
 of 9
 disks, with the 9th disk not far behind the others.

 I cannot believe all 9 disks are bad
 They are the same disks as the newest 1st server,
 Crucial_CT960M500SSD1,
 and same exact server hardware too.
 They were purchased at the same time in the same purchase order
 and
 arrived at the same time.
 So I cannot believe I just happened to put 9 bad disks in one
 server,
 and 9 good ones in the other.

 I know I have Ceph configured exactly the same on all servers
 And I am sure I have the hardware settings configured exactly the
 same
 on the 1st and 2nd servers.
 So if I were someone else, I would say it maybe is bad hardware
 on the
 2nd server.
 But the 2nd server is running very well without any hint of a
 problem.

 Any other ideas or suggestions?

 -RG

 On Wed, Oct 18, 2017 at 3:40 PM, Maged Mokhtar
 <mmokhtar@xxxxxxxxxxx>
 wrote:

just run the same 32 threaded rados test as you did before and
 this
 time run atop while the test is running looking for %busy of
 cpu/disks. It
 should give an idea if there is a bottleneck in them.

 On 2017-10-18 21:35, Russell Glaue wrote:

 I cannot run the write test reviewed at the
 ceph-how-to-test-if-your-s
 sd-is-suitable-as-a-journal-device blog. The tests write
 directly to
 the raw disk device.
 Reading an infile (created with urandom) on one SSD, writing the
 outfile to another osd, yields about 17MB/s.
 But Isn't this write speed limited by the speed in which in the
 dd
 infile can be read?
 And I assume the best test should be run with no other load.

 How does one run the rados bench "as stress"?

 -RG

 On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar
 <mmokhtar@xxxxxxxxxxx>
 wrote:

measuring resource load as outlined earlier will show if the
 drives
 are performing well or not. Also how many osds do you have  ?

 On 2017-10-18 19:26, Russell Glaue wrote:

 The SSD drives are Crucial M500
 A Ceph user did some benchmarks and found it had good
 performance
https://forum.proxmox.com/threads/ceph-bad-performance-in-
 qemu-guests.21551/

 However, a user comment from 3 years ago on the blog post you
 linked
 to says to avoid the Crucial M500

 Yet, this performance posting tells that the Crucial M500 is
 good.
https://inside.servers.com/ssd-performance-2017-c4307a92dea

 On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar
 <mmokhtar@xxxxxxxxxxx>
 wrote:

Check out the following link: some SSDs perform bad in Ceph
 due to
 sync writes to journal

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-tes
 t-if-your-ssd-is-suitable-as-a-journal-device/

 Anther thing that can help is to re-run the rados 32 threads
 as
 stress and view resource usage using atop (or collectl/sar) to
 check for
 %busy cpu and %busy disks to give you an idea of what is
 holding down your
 cluster..for example: if cpu/disk % are all low then check
 your
 network/switches.  If disk %busy is high (90%) for all disks
 then your
 disks are the bottleneck: which either means you have SSDs
 that are not
 suitable for Ceph or you have too few disks (which i doubt is
 the case). If
 only 1 disk %busy is high, there may be something wrong with
 this disk
 should be removed.

 Maged

 On 2017-10-18 18:13, Russell Glaue wrote:

 In my previous post, in one of my points I was wondering if
 the
 request size would increase if I enabled jumbo packets.
 currently it is
 disabled.

 @jdillama: The qemu settings for both these two guest
 machines, with
 RAID/LVM and Ceph/rbd images, are the same. I am not thinking
 that changing
 the qemu settings of "min_io_size=<limited to
 16bits>,opt_io_size=<RBD
 image object size>" will directly address the issue.

 @mmokhtar: Ok. So you suggest the request size is the result
 of the
 problem and not the cause of the problem. meaning I should go
 after a
 different issue.

 I have been trying to get write speeds up to what people on
 this mail
 list are discussing.
 It seems that for our configuration, as it matches others, we
 should
 be getting about 70MB/s write speed.
 But we are not getting that.
 Single writes to disk are lucky to get 5MB/s to 6MB/s, but are
 typically 1MB/s to 2MB/s.
 Monitoring the entire Ceph cluster (using
http://cephdash.crapworks.de/), I have seen very rare
 momentary
 spikes up to 30MB/s.

 My storage network is connected via a 10Gb switch
 I have 4 storage servers with a LSI Logic MegaRAID SAS 2208
 controller
 Each storage server has 9 1TB SSD drives, each drive as 1 osd
 (no
 RAID)
 Each drive is one LVM group, with two volumes - one volume for
 the
 osd, one volume for the journal
 Each osd is formatted with xfs
 The crush map is simple: default->rack->[host[1..4]->osd] with
 an
 evenly distributed weight
 The redundancy is triple replication

 While I have read comments that having the osd and journal on
 the
 same disk decreases write speed, I have also read that once
 past 8 OSDs per
 node this is the recommended configuration, however this is
 also the reason
 why SSD drives are used exclusively for OSDs in the storage
 nodes.
 None-the-less, I was still expecting write speeds to be above
 30MB/s,
 not below 6MB/s.
 Even at 12x slower than the RAID, using my previously posted
 iostat
 data set, I should be seeing write speeds that average 10MB/s,
 not 2MB/s.

 In regards to the rados benchmark tests you asked me to run,
 here is
 the output:

 [centos7]# rados bench -p scbench -b 4096 30 write -t 1
 Maintaining 1 concurrent writes of 4096 bytes to objects of
 size 4096
 for up to 30 seconds or 0 objects
 Object prefix: benchmark_data_hamms.sys.cu.cait.org_85049
   sec Cur ops   started  finished  avg MB/s  cur MB/s last
 lat(s)
  avg lat(s)
     0       0         0         0         0         0
 -
       0
     1       1       201       200   0.78356   0.78125
 0.00522307
  0.00496574
     2       1       469       468  0.915303   1.04688
 0.00437497
  0.00426141
     3       1       741       740  0.964371    1.0625
 0.00512853
 0.0040434
     4       1       888       887  0.866739  0.574219
 0.00307699
  0.00450177
     5       1      1147      1146  0.895725   1.01172
 0.00376454
 0.0043559
     6       1      1325      1324  0.862293  0.695312
 0.00459443
  0.004525
     7       1      1494      1493   0.83339  0.660156
 0.00461002
  0.00458452
     8       1      1736      1735  0.847369  0.945312
 0.00253971
  0.00460458
     9       1      1998      1997  0.866922   1.02344
 0.00236573
  0.00450172
    10       1      2260      2259  0.882563   1.02344
 0.00262179
  0.00442152
    11       1      2526      2525  0.896775   1.03906
 0.00336914
  0.00435092
    12       1      2760      2759  0.898203  0.914062
 0.00351827
  0.00434491
    13       1      3016      3015  0.906025         1
 0.00335703
  0.00430691
    14       1      3257      3256  0.908545  0.941406
 0.00332344
  0.00429495
    15       1      3490      3489  0.908644  0.910156
 0.00318815
  0.00426387
    16       1      3728      3727  0.909952  0.929688
 0.0032881
  0.00428895
    17       1      3986      3985  0.915703   1.00781
 0.00274809
 0.0042614
    18       1      4250      4249  0.922116   1.03125
 0.00287411
  0.00423214
    19       1      4505      4504  0.926003  0.996094
 0.00375435
  0.00421442
 2017-10-18 10:56:31.267173 min lat: 0.00181259 max lat:
 0.270553 avg
 lat: 0.00420118
   sec Cur ops   started  finished  avg MB/s  cur MB/s last
 lat(s)
  avg lat(s)
    20       1      4757      4756  0.928915  0.984375
 0.00463972
  0.00420118
    21       1      5009      5008   0.93155  0.984375
 0.00360065
  0.00418937
    22       1      5235      5234  0.929329  0.882812
 0.00626214
  0.004199
    23       1      5500      5499  0.933925   1.03516
 0.00466584
  0.00417836
    24       1      5708      5707  0.928861    0.8125
 0.00285727
  0.00420146
    25       0      5964      5964  0.931858   1.00391
 0.00417383
 0.0041881
    26       1      6216      6215  0.933722  0.980469
 0.0041009
  0.00417915
    27       1      6481      6480  0.937474   1.03516
 0.00307484
  0.00416118
    28       1      6745      6744  0.940819   1.03125
 0.00266329
  0.00414777
    29       1      7003      7002  0.943124   1.00781
 0.00305905
  0.00413758
    30       1      7271      7270  0.946578   1.04688
 0.00391017
  0.00412238
 Total time run:         30.006060
 Total writes made:      7272
 Write size:             4096
 Object size:            4096
 Bandwidth (MB/sec):     0.946684
 Stddev Bandwidth:       0.123762
 Max bandwidth (MB/sec): 1.0625
 Min bandwidth (MB/sec): 0.574219
 Average IOPS:           242
 Stddev IOPS:            31
 Max IOPS:               272
 Min IOPS:               147
 Average Latency(s):     0.00412247
 Stddev Latency(s):      0.00648437
 Max latency(s):         0.270553
 Min latency(s):         0.00175318
 Cleaning up (deleting benchmark objects)
 Clean up completed and total clean up time :29.069423

 [centos7]# rados bench -p scbench -b 4096 30 write -t 32
 Maintaining 32 concurrent writes of 4096 bytes to objects of
 size
 4096 for up to 30 seconds or 0 objects
 Object prefix: benchmark_data_hamms.sys.cu.cait.org_86076
   sec Cur ops   started  finished  avg MB/s  cur MB/s last
 lat(s)
  avg lat(s)
     0       0         0         0         0         0
 -
       0
     1      32      3013      2981   11.6438   11.6445
 0.00247906
  0.00572026
     2      32      5349      5317   10.3834     9.125
 0.00246662
  0.00932016
     3      32      5707      5675    7.3883   1.39844
 0.00389774
 0.0156726
     4      32      5895      5863   5.72481  0.734375
 1.13137
 0.0167946
     5      32      6869      6837   5.34068   3.80469
 0.0027652
 0.0226577
     6      32      8901      8869   5.77306    7.9375
 0.0053211
 0.0216259
     7      32     10800     10768   6.00785   7.41797
 0.00358187
 0.0207418
     8      32     11825     11793   5.75728   4.00391
 0.00217575
 0.0215494
     9      32     12941     12909    5.6019   4.35938
 0.00278512
 0.0220567
    10      32     13317     13285   5.18849   1.46875
 0.0034973
 0.0240665
    11      32     16189     16157   5.73653   11.2188
 0.00255841
 0.0212708
    12      32     16749     16717   5.44077    2.1875
 0.00330334
 0.0215915
    13      32     16756     16724   5.02436 0.0273438
 0.00338994
  0.021849
    14      32     17908     17876   4.98686       4.5
 0.00402598
 0.0244568
    15      32     17936     17904   4.66171  0.109375
 0.00375799
 0.0245545
    16      32     18279     18247   4.45409   1.33984
 0.00483873
 0.0267929
    17      32     18372     18340   4.21346  0.363281
 0.00505187
 0.0275887
    18      32     19403     19371   4.20309   4.02734
 0.00545154
  0.029348
    19      31     19845     19814   4.07295   1.73047
 0.00254726
 0.0306775
 2017-10-18 10:57:58.160536 min lat: 0.0015005 max lat: 2.27707
 avg
 lat: 0.0307559
   sec Cur ops   started  finished  avg MB/s  cur MB/s last
 lat(s)
  avg lat(s)
    20      31     20401     20370   3.97788   2.17188
 0.00307238
 0.0307559
    21      32     21338     21306   3.96254   3.65625
 0.00464563
 0.0312288
    22      32     23057     23025    4.0876   6.71484
 0.00296295
 0.0299267
    23      32     23057     23025   3.90988         0
 -
 0.0299267
    24      32     23803     23771   3.86837   1.45703
 0.00301471
 0.0312804
    25      32     24112     24080   3.76191   1.20703
 0.00191063
 0.0331462
    26      31     25303     25272   3.79629   4.65625
 0.00794399
 0.0329129
    27      32     28803     28771   4.16183    13.668
 0.0109817
 0.0297469
    28      32     29592     29560   4.12325   3.08203
 0.00188185
 0.0301911
    29      32     30595     30563   4.11616   3.91797
 0.00379099
 0.0296794
    30      32     31031     30999   4.03572   1.70312
 0.00283347
 0.0302411
 Total time run:         30.822350
 Total writes made:      31032
 Write size:             4096
 Object size:            4096
 Bandwidth (MB/sec):     3.93282
 Stddev Bandwidth:       3.66265
 Max bandwidth (MB/sec): 13.668
 Min bandwidth (MB/sec): 0
 Average IOPS:           1006
 Stddev IOPS:            937
 Max IOPS:               3499
 Min IOPS:               0
 Average Latency(s):     0.0317779
 Stddev Latency(s):      0.164076
 Max latency(s):         2.27707
 Min latency(s):         0.0013848
 Cleaning up (deleting benchmark objects)
 Clean up completed and total clean up time :20.166559

 On Wed, Oct 18, 2017 at 8:51 AM, Maged Mokhtar
 <mmokhtar@xxxxxxxxxxx>
 wrote:

First a general comment: local RAID will be faster than Ceph
 for a
 single threaded (queue depth=1) io operation test. A single
 thread Ceph
 client will see at best same disk speed for reads and for
 writes 4-6 times
 slower than single disk. Not to mention the latency of local
 disks will
 much better. Where Ceph shines is when you have many
 concurrent ios, it
 scales whereas RAID will decrease speed per client as you add
 more.

 Having said that, i would recommend running rados/rbd
 bench-write
 and measure 4k iops at 1 and 32 threads to get a better idea
 of how your
 cluster performs:

 ceph osd pool create testpool 256 256
 rados bench -p testpool -b 4096 30 write -t 1
 rados bench -p testpool -b 4096 30 write -t 32
 ceph osd pool delete testpool testpool
 --yes-i-really-really-mean-it

 rbd bench-write test-image --io-threads=1 --io-size 4096
 --io-pattern rand --rbd_cache=false
 rbd bench-write test-image --io-threads=32 --io-size 4096
 --io-pattern rand --rbd_cache=false

 I think the request size difference you see is due to the io
 scheduler in the case of local disks having more ios to
 re-group so has a
 better chance in generating larger requests. Depending on
 your kernel, the
 io scheduler may be different for rbd (blq-mq) vs sdx (cfq)
 but again i
 would think the request size is a result not a cause.

 Maged

 On 2017-10-17 23:12, Russell Glaue wrote:

 I am running ceph jewel on 5 nodes with SSD OSDs.
 I have an LVM image on a local RAID of spinning disks.
 I have an RBD image on in a pool of SSD disks.
 Both disks are used to run an almost identical CentOS 7
 system.
 Both systems were installed with the same kickstart, though
 the disk
 partitioning is different.

 I want to make writes on the the ceph image faster. For
 example,
 lots of writes to MySQL (via MySQL replication) on a ceph SSD
 image are
 about 10x slower than on a spindle RAID disk image. The MySQL
 server on
 ceph rbd image has a hard time keeping up in replication.

 So I wanted to test writes on these two systems
 I have a 10GB compressed (gzip) file on both servers.
 I simply gunzip the file on both systems, while running
 iostat.

 The primary difference I see in the results is the average
 size of
 the request to the disk.
 CentOS7-lvm-raid-sata writes a lot faster to disk, and the
 size of
 the request is about 40x, but the number of writes per second
 is about the
 same
 This makes me want to conclude that the smaller size of the
 request
 for CentOS7-ceph-rbd-ssd system is the cause of it being
 slow.

 How can I make the size of the request larger for ceph rbd
 images,
 so I can increase the write throughput?
 Would this be related to having jumbo packets enabled in my
 ceph
 storage network?

 Here is a sample of the results:

 [CentOS7-lvm-raid-sata]
 $ gunzip large10gFile.gz &
 $ iostat -x vg_root-lv_var -d 5 -m -N
 Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s
 wMB/s
 avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
 ...
 vg_root-lv_var     0.00     0.00   30.60  452.20    13.60
 222.15
  1000.04     8.69   14.05    0.99   14.93   2.07 100.04
 vg_root-lv_var     0.00     0.00   88.20  182.00    39.20
 89.43
 974.95     4.65    9.82    0.99   14.10   3.70 100.00
 vg_root-lv_var     0.00     0.00   75.45  278.24    33.53
 136.70
 985.73     4.36   33.26    1.34   41.91   0.59  20.84
 vg_root-lv_var     0.00     0.00  111.60  181.80    49.60
 89.34
 969.84     2.60    8.87    0.81   13.81   0.13   3.90
 vg_root-lv_var     0.00     0.00   68.40  109.60    30.40
 53.63
 966.87     1.51    8.46    0.84   13.22   0.80  14.16
 ...

 [CentOS7-ceph-rbd-ssd]
 $ gunzip large10gFile.gz &
 $ iostat -x vg_root-lv_data -d 5 -m -N
 Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s
 wMB/s
 avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
 ...
 vg_root-lv_data     0.00     0.00   46.40  167.80     0.88
 1.46
    22.36     1.23    5.66    2.47    6.54   4.52  96.82
 vg_root-lv_data     0.00     0.00   16.60   55.20     0.36
 0.14
    14.44     0.99   13.91    9.12   15.36  13.71  98.46
 vg_root-lv_data     0.00     0.00   69.00  173.80     1.34
 1.32
    22.48     1.25    5.19    3.77    5.75   3.94  95.68
 vg_root-lv_data     0.00     0.00   74.40  293.40     1.37
 1.47
    15.83     1.22    3.31    2.06    3.63   2.54  93.26
 vg_root-lv_data     0.00     0.00   90.80  359.00     1.96
 3.41
    24.45     1.63    3.63    1.94    4.05   2.10  94.38
 ...

 [iostat key]
 w/s == The number (after merges) of write requests completed
 per
 second for the device.
 wMB/s == The number of sectors (kilobytes, megabytes) written
 to the
 device per second.
 avgrq-sz == The average size (in kilobytes) of the requests
 that
 were issued to the device.
 avgqu-sz == The average queue length of the requests that
 were
 issued to the device.

 _______________________________________________
 ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 _______________________________________________
 ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
 ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 --
 Christian Balzer        Network/Systems Engineer
chibi@xxxxxxx           Rakuten Communications
 _______________________________________________
 ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 _______________________________________________
 ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 _______________________________________________
 ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 _______________________________________________
 ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com