Re: How to increase the size of requests written to a ceph image

Russell Glaue <rglaue@xxxxxxxx> · Mon, 23 Oct 2017 08:28:01 -0500

The two newest machines have the LSI MegaRAID SAS-3 3008 [Fury]. The first one performs the best of the four. The second one is the problem host. The Non-RAID option just takes RAID configuration out of the picture so ceph can have direct access to the disk. We need that to have ceph's support of the SSD clipping function, in the future. The controllers having RAID-only do not support SSD clipping, and we are forced to have a RAID for every disk, which we don't like.
If you know of problems with LSI MegaRAID, please elaborate.
Thanks.
-RG

On Fri, Oct 20, 2017 at 10:04 PM, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

On Fri, 20 Oct 2017 13:35:55 -0500 Russell Glaue wrote:

> On the machine in question, the 2nd newest, we are using the LSI MegaRAID

> SAS-3 3008 [Fury], which allows us a "Non-RAID" option, and has no battery.

> The older two use the LSI MegaRAID SAS 2208 [Thunderbolt] I reported

> earlier, each single drive configured as RAID0.

>

There you go then, that's your explanation.

And also the reason that these SSDs perform so "well" in the RAID0

config despite my doubts about their suitability with Ceph.

If you were to put Intel DC S36xx, S37xx or Samsung SM 863 in the IT mode

host you'd likely get the speed you want if not better.

Christian

> Thanks for everyone's help.

> I am going to run a 32 thread bench test after taking the 2nd machine out

> of the cluster with noout.

> After it is out of the cluster, I am expecting the slow write issue will

> not surface.

>

>

> On Fri, Oct 20, 2017 at 5:27 AM, David Turner <drakonstein@xxxxxxxxx> wrote:

>

> > I can attest that the battery in the raid controller is a thing. I'm used

> > to using lsi controllers, but my current position has hp raid controllers

> > and we just tracked down 10 of our nodes that had >100ms await pretty much

> > always were the only 10 nodes in the cluster with failed batteries on the

> > raid controllers.

> >

> > On Thu, Oct 19, 2017, 8:15 PM Christian Balzer <chibi@xxxxxxx> wrote:

> >

> >>

> >> Hello,

> >>

> >> On Thu, 19 Oct 2017 17:14:17 -0500 Russell Glaue wrote:

> >>

> >> > That is a good idea.

> >> > However, a previous rebalancing processes has brought performance of our

> >> > Guest VMs to a slow drag.

> >> >

> >>

> >> Never mind that I'm not sure that these SSDs are particular well suited

> >> for Ceph, your problem is clearly located on that one node.

> >>

> >> Not that I think it's the case, but make sure your PG distribution is not

> >> skewed with many more PGs per OSD on that node.

> >>

> >> Once you rule that out my first guess is the RAID controller, you're

> >> running the SSDs are single RAID0s I presume?

> >> If so a either configuration difference or a failed BBU on the controller

> >> could result in the writeback cache being disabled, which would explain

> >> things beautifully.

> >>

> >> As for a temporary test/fix (with reduced redundancy of course), set noout

> >> (or mon_osd_down_out_subtree_limit accordingly) and turn the slow host

> >> off.

> >>

> >> This should result in much better performance than you have now and of

> >> course be the final confirmation of that host being the culprit.

> >>

> >> Christian

> >>

> >> >

> >> > On Thu, Oct 19, 2017 at 3:55 PM, Jean-Charles Lopez <jelopez@xxxxxxxxxx

> >> >

> >> > wrote:

> >> >

> >> > > Hi Russell,

> >> > >

> >> > > as you have 4 servers, assuming you are not doing EC pools, just stop

> >> all

> >> > > the OSDs on the second questionable server, mark the OSDs on that

> >> server as

> >> > > out, let the cluster rebalance and when all PGs are active+clean just

> >> > > replay the test.

> >> > >

> >> > > All IOs should then go only to the other 3 servers.

> >> > >

> >> > > JC

> >> > >

> >> > > On Oct 19, 2017, at 13:49, Russell Glaue <rglaue@xxxxxxxx> wrote:

> >> > >

> >> > > No, I have not ruled out the disk controller and backplane making the

> >> > > disks slower.

> >> > > Is there a way I could test that theory, other than swapping out

> >> hardware?

> >> > > -RG

> >> > >

> >> > > On Thu, Oct 19, 2017 at 3:44 PM, David Turner <drakonstein@xxxxxxxxx>

> >> > > wrote:

> >> > >

> >> > >> Have you ruled out the disk controller and backplane in the server

> >> > >> running slower?

> >> > >>

> >> > >> On Thu, Oct 19, 2017 at 4:42 PM Russell Glaue <rglaue@xxxxxxxx>

> >> wrote:

> >> > >>

> >> > >>> I ran the test on the Ceph pool, and ran atop on all 4 storage

> >> servers,

> >> > >>> as suggested.

> >> > >>>

> >> > >>> Out of the 4 servers:

> >> > >>> 3 of them performed with 17% to 30% disk %busy, and 11% CPU wait.

> >> > >>> Momentarily spiking up to 50% on one server, and 80% on another

> >> > >>> The 2nd newest server was almost averaging 90% disk %busy and 150%

> >> CPU

> >> > >>> wait. And more than momentarily spiking to 101% disk busy and 250%

> >> CPU wait.

> >> > >>> For this 2nd newest server, this was the statistics for about 8 of 9

> >> > >>> disks, with the 9th disk not far behind the others.

> >> > >>>

> >> > >>> I cannot believe all 9 disks are bad

> >> > >>> They are the same disks as the newest 1st server,

> >> Crucial_CT960M500SSD1,

> >> > >>> and same exact server hardware too.

> >> > >>> They were purchased at the same time in the same purchase order and

> >> > >>> arrived at the same time.

> >> > >>> So I cannot believe I just happened to put 9 bad disks in one

> >> server,

> >> > >>> and 9 good ones in the other.

> >> > >>>

> >> > >>> I know I have Ceph configured exactly the same on all servers

> >> > >>> And I am sure I have the hardware settings configured exactly the

> >> same

> >> > >>> on the 1st and 2nd servers.

> >> > >>> So if I were someone else, I would say it maybe is bad hardware on

> >> the

> >> > >>> 2nd server.

> >> > >>> But the 2nd server is running very well without any hint of a

> >> problem.

> >> > >>>

> >> > >>> Any other ideas or suggestions?

> >> > >>>

> >> > >>> -RG

> >> > >>>

> >> > >>>

> >> > >>> On Wed, Oct 18, 2017 at 3:40 PM, Maged Mokhtar <

> >> mmokhtar@xxxxxxxxxxx>

> >> > >>> wrote:

> >> > >>>

> >> > >>>> just run the same 32 threaded rados test as you did before and this

> >> > >>>> time run atop while the test is running looking for %busy of

> >> cpu/disks. It

> >> > >>>> should give an idea if there is a bottleneck in them.

> >> > >>>>

> >> > >>>> On 2017-10-18 21:35, Russell Glaue wrote:

> >> > >>>>

> >> > >>>> I cannot run the write test reviewed at the

> >> ceph-how-to-test-if-your-s

> >> > >>>> sd-is-suitable-as-a-journal-device blog. The tests write directly

> >> to

> >> > >>>> the raw disk device.

> >> > >>>> Reading an infile (created with urandom) on one SSD, writing the

> >> > >>>> outfile to another osd, yields about 17MB/s.

> >> > >>>> But Isn't this write speed limited by the speed in which in the dd

> >> > >>>> infile can be read?

> >> > >>>> And I assume the best test should be run with no other load.

> >> > >>>>

> >> > >>>> How does one run the rados bench "as stress"?

> >> > >>>>

> >> > >>>> -RG

> >> > >>>>

> >> > >>>>

> >> > >>>> On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar <

> >> mmokhtar@xxxxxxxxxxx>

> >> > >>>> wrote:

> >> > >>>>

> >> > >>>>> measuring resource load as outlined earlier will show if the

> >> drives

> >> > >>>>> are performing well or not. Also how many osds do you have  ?

> >> > >>>>>

> >> > >>>>> On 2017-10-18 19:26, Russell Glaue wrote:

> >> > >>>>>

> >> > >>>>> The SSD drives are Crucial M500

> >> > >>>>> A Ceph user did some benchmarks and found it had good performance

> >> > >>>>> https://forum.proxmox.com/threads/ceph-bad-performance-in-

> >> > >>>>> qemu-guests.21551/

> >> > >>>>>

> >> > >>>>> However, a user comment from 3 years ago on the blog post you

> >> linked

> >> > >>>>> to says to avoid the Crucial M500

> >> > >>>>>

> >> > >>>>> Yet, this performance posting tells that the Crucial M500 is good.

> >> > >>>>> https://inside.servers.com/ssd-performance-2017-c4307a92dea

> >> > >>>>>

> >> > >>>>> On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar <

> >> mmokhtar@xxxxxxxxxxx>

> >> > >>>>> wrote:

> >> > >>>>>

> >> > >>>>>> Check out the following link: some SSDs perform bad in Ceph due

> >> to

> >> > >>>>>> sync writes to journal

> >> > >>>>>>

> >> > >>>>>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-tes

> >> > >>>>>> t-if-your-ssd-is-suitable-as-a-journal-device/

> >> > >>>>>>

> >> > >>>>>> Anther thing that can help is to re-run the rados 32 threads as

> >> > >>>>>> stress and view resource usage using atop (or collectl/sar) to

> >> check for

> >> > >>>>>> %busy cpu and %busy disks to give you an idea of what is holding

> >> down your

> >> > >>>>>> cluster..for example: if cpu/disk % are all low then check your

> >> > >>>>>> network/switches.  If disk %busy is high (90%) for all disks

> >> then your

> >> > >>>>>> disks are the bottleneck: which either means you have SSDs that

> >> are not

> >> > >>>>>> suitable for Ceph or you have too few disks (which i doubt is

> >> the case). If

> >> > >>>>>> only 1 disk %busy is high, there may be something wrong with

> >> this disk

> >> > >>>>>> should be removed.

> >> > >>>>>>

> >> > >>>>>> Maged

> >> > >>>>>>

> >> > >>>>>> On 2017-10-18 18:13, Russell Glaue wrote:

> >> > >>>>>>

> >> > >>>>>> In my previous post, in one of my points I was wondering if the

> >> > >>>>>> request size would increase if I enabled jumbo packets.

> >> currently it is

> >> > >>>>>> disabled.

> >> > >>>>>>

> >> > >>>>>> @jdillama: The qemu settings for both these two guest machines,

> >> with

> >> > >>>>>> RAID/LVM and Ceph/rbd images, are the same. I am not thinking

> >> that changing

> >> > >>>>>> the qemu settings of "min_io_size=<limited to

> >> 16bits>,opt_io_size=<RBD

> >> > >>>>>> image object size>" will directly address the issue.

> >> > >>>>>>

> >> > >>>>>> @mmokhtar: Ok. So you suggest the request size is the result of

> >> the

> >> > >>>>>> problem and not the cause of the problem. meaning I should go

> >> after a

> >> > >>>>>> different issue.

> >> > >>>>>>

> >> > >>>>>> I have been trying to get write speeds up to what people on this

> >> mail

> >> > >>>>>> list are discussing.

> >> > >>>>>> It seems that for our configuration, as it matches others, we

> >> should

> >> > >>>>>> be getting about 70MB/s write speed.

> >> > >>>>>> But we are not getting that.

> >> > >>>>>> Single writes to disk are lucky to get 5MB/s to 6MB/s, but are

> >> > >>>>>> typically 1MB/s to 2MB/s.

> >> > >>>>>> Monitoring the entire Ceph cluster (using

> >> > >>>>>> http://cephdash.crapworks.de/), I have seen very rare momentary

> >> > >>>>>> spikes up to 30MB/s.

> >> > >>>>>>

> >> > >>>>>> My storage network is connected via a 10Gb switch

> >> > >>>>>> I have 4 storage servers with a LSI Logic MegaRAID SAS 2208

> >> controller

> >> > >>>>>> Each storage server has 9 1TB SSD drives, each drive as 1 osd (no

> >> > >>>>>> RAID)

> >> > >>>>>> Each drive is one LVM group, with two volumes - one volume for

> >> the

> >> > >>>>>> osd, one volume for the journal

> >> > >>>>>> Each osd is formatted with xfs

> >> > >>>>>> The crush map is simple: default->rack->[host[1..4]->osd] with

> >> an

> >> > >>>>>> evenly distributed weight

> >> > >>>>>> The redundancy is triple replication

> >> > >>>>>>

> >> > >>>>>> While I have read comments that having the osd and journal on the

> >> > >>>>>> same disk decreases write speed, I have also read that once past

> >> 8 OSDs per

> >> > >>>>>> node this is the recommended configuration, however this is also

> >> the reason

> >> > >>>>>> why SSD drives are used exclusively for OSDs in the storage

> >> nodes.

> >> > >>>>>> None-the-less, I was still expecting write speeds to be above

> >> 30MB/s,

> >> > >>>>>> not below 6MB/s.

> >> > >>>>>> Even at 12x slower than the RAID, using my previously posted

> >> iostat

> >> > >>>>>> data set, I should be seeing write speeds that average 10MB/s,

> >> not 2MB/s.

> >> > >>>>>>

> >> > >>>>>> In regards to the rados benchmark tests you asked me to run,

> >> here is

> >> > >>>>>> the output:

> >> > >>>>>>

> >> > >>>>>> [centos7]# rados bench -p scbench -b 4096 30 write -t 1

> >> > >>>>>> Maintaining 1 concurrent writes of 4096 bytes to objects of size

> >> 4096

> >> > >>>>>> for up to 30 seconds or 0 objects

> >> > >>>>>> Object prefix: benchmark_data_hamms.sys.cu.cait.org_85049

> >> > >>>>>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)

> >> > >>>>>>  avg lat(s)

> >> > >>>>>>     0       0         0         0         0         0           -

> >> > >>>>>>       0

> >> > >>>>>>     1       1       201       200   0.78356   0.78125  0.00522307

> >> > >>>>>>  0.00496574

> >> > >>>>>>     2       1       469       468  0.915303   1.04688  0.00437497

> >> > >>>>>>  0.00426141

> >> > >>>>>>     3       1       741       740  0.964371    1.0625  0.00512853

> >> > >>>>>> 0.0040434

> >> > >>>>>>     4       1       888       887  0.866739  0.574219  0.00307699

> >> > >>>>>>  0.00450177

> >> > >>>>>>     5       1      1147      1146  0.895725   1.01172  0.00376454

> >> > >>>>>> 0.0043559

> >> > >>>>>>     6       1      1325      1324  0.862293  0.695312  0.00459443

> >> > >>>>>>  0.004525

> >> > >>>>>>     7       1      1494      1493   0.83339  0.660156  0.00461002

> >> > >>>>>>  0.00458452

> >> > >>>>>>     8       1      1736      1735  0.847369  0.945312  0.00253971

> >> > >>>>>>  0.00460458

> >> > >>>>>>     9       1      1998      1997  0.866922   1.02344  0.00236573

> >> > >>>>>>  0.00450172

> >> > >>>>>>    10       1      2260      2259  0.882563   1.02344  0.00262179

> >> > >>>>>>  0.00442152

> >> > >>>>>>    11       1      2526      2525  0.896775   1.03906  0.00336914

> >> > >>>>>>  0.00435092

> >> > >>>>>>    12       1      2760      2759  0.898203  0.914062  0.00351827

> >> > >>>>>>  0.00434491

> >> > >>>>>>    13       1      3016      3015  0.906025         1  0.00335703

> >> > >>>>>>  0.00430691

> >> > >>>>>>    14       1      3257      3256  0.908545  0.941406  0.00332344

> >> > >>>>>>  0.00429495

> >> > >>>>>>    15       1      3490      3489  0.908644  0.910156  0.00318815

> >> > >>>>>>  0.00426387

> >> > >>>>>>    16       1      3728      3727  0.909952  0.929688   0.0032881

> >> > >>>>>>  0.00428895

> >> > >>>>>>    17       1      3986      3985  0.915703   1.00781  0.00274809

> >> > >>>>>> 0.0042614

> >> > >>>>>>    18       1      4250      4249  0.922116   1.03125  0.00287411

> >> > >>>>>>  0.00423214

> >> > >>>>>>    19       1      4505      4504  0.926003  0.996094  0.00375435

> >> > >>>>>>  0.00421442

> >> > >>>>>> 2017-10-18 10:56:31.267173 min lat: 0.00181259 max lat: 0.270553

> >> avg

> >> > >>>>>> lat: 0.00420118

> >> > >>>>>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)

> >> > >>>>>>  avg lat(s)

> >> > >>>>>>    20       1      4757      4756  0.928915  0.984375  0.00463972

> >> > >>>>>>  0.00420118

> >> > >>>>>>    21       1      5009      5008   0.93155  0.984375  0.00360065

> >> > >>>>>>  0.00418937

> >> > >>>>>>    22       1      5235      5234  0.929329  0.882812  0.00626214

> >> > >>>>>>  0.004199

> >> > >>>>>>    23       1      5500      5499  0.933925   1.03516  0.00466584

> >> > >>>>>>  0.00417836

> >> > >>>>>>    24       1      5708      5707  0.928861    0.8125  0.00285727

> >> > >>>>>>  0.00420146

> >> > >>>>>>    25       0      5964      5964  0.931858   1.00391  0.00417383

> >> > >>>>>> 0.0041881

> >> > >>>>>>    26       1      6216      6215  0.933722  0.980469   0.0041009

> >> > >>>>>>  0.00417915

> >> > >>>>>>    27       1      6481      6480  0.937474   1.03516  0.00307484

> >> > >>>>>>  0.00416118

> >> > >>>>>>    28       1      6745      6744  0.940819   1.03125  0.00266329

> >> > >>>>>>  0.00414777

> >> > >>>>>>    29       1      7003      7002  0.943124   1.00781  0.00305905

> >> > >>>>>>  0.00413758

> >> > >>>>>>    30       1      7271      7270  0.946578   1.04688  0.00391017

> >> > >>>>>>  0.00412238

> >> > >>>>>> Total time run:         30.006060

> >> > >>>>>> Total writes made:      7272

> >> > >>>>>> Write size:             4096

> >> > >>>>>> Object size:            4096

> >> > >>>>>> Bandwidth (MB/sec):     0.946684

> >> > >>>>>> Stddev Bandwidth:       0.123762

> >> > >>>>>> Max bandwidth (MB/sec): 1.0625

> >> > >>>>>> Min bandwidth (MB/sec): 0.574219

> >> > >>>>>> Average IOPS:           242

> >> > >>>>>> Stddev IOPS:            31

> >> > >>>>>> Max IOPS:               272

> >> > >>>>>> Min IOPS:               147

> >> > >>>>>> Average Latency(s):     0.00412247

> >> > >>>>>> Stddev Latency(s):      0.00648437

> >> > >>>>>> Max latency(s):         0.270553

> >> > >>>>>> Min latency(s):         0.00175318

> >> > >>>>>> Cleaning up (deleting benchmark objects)

> >> > >>>>>> Clean up completed and total clean up time :29.069423

> >> > >>>>>>

> >> > >>>>>> [centos7]# rados bench -p scbench -b 4096 30 write -t 32

> >> > >>>>>> Maintaining 32 concurrent writes of 4096 bytes to objects of size

> >> > >>>>>> 4096 for up to 30 seconds or 0 objects

> >> > >>>>>> Object prefix: benchmark_data_hamms.sys.cu.cait.org_86076

> >> > >>>>>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)

> >> > >>>>>>  avg lat(s)

> >> > >>>>>>     0       0         0         0         0         0           -

> >> > >>>>>>       0

> >> > >>>>>>     1      32      3013      2981   11.6438   11.6445  0.00247906

> >> > >>>>>>  0.00572026

> >> > >>>>>>     2      32      5349      5317   10.3834     9.125  0.00246662

> >> > >>>>>>  0.00932016

> >> > >>>>>>     3      32      5707      5675    7.3883   1.39844  0.00389774

> >> > >>>>>> 0.0156726

> >> > >>>>>>     4      32      5895      5863   5.72481  0.734375     1.13137

> >> > >>>>>> 0.0167946

> >> > >>>>>>     5      32      6869      6837   5.34068   3.80469   0.0027652

> >> > >>>>>> 0.0226577

> >> > >>>>>>     6      32      8901      8869   5.77306    7.9375   0.0053211

> >> > >>>>>> 0.0216259

> >> > >>>>>>     7      32     10800     10768   6.00785   7.41797  0.00358187

> >> > >>>>>> 0.0207418

> >> > >>>>>>     8      32     11825     11793   5.75728   4.00391  0.00217575

> >> > >>>>>> 0.0215494

> >> > >>>>>>     9      32     12941     12909    5.6019   4.35938  0.00278512

> >> > >>>>>> 0.0220567

> >> > >>>>>>    10      32     13317     13285   5.18849   1.46875   0.0034973

> >> > >>>>>> 0.0240665

> >> > >>>>>>    11      32     16189     16157   5.73653   11.2188  0.00255841

> >> > >>>>>> 0.0212708

> >> > >>>>>>    12      32     16749     16717   5.44077    2.1875  0.00330334

> >> > >>>>>> 0.0215915

> >> > >>>>>>    13      32     16756     16724   5.02436 0.0273438  0.00338994

> >> > >>>>>>  0.021849

> >> > >>>>>>    14      32     17908     17876   4.98686       4.5  0.00402598

> >> > >>>>>> 0.0244568

> >> > >>>>>>    15      32     17936     17904   4.66171  0.109375  0.00375799

> >> > >>>>>> 0.0245545

> >> > >>>>>>    16      32     18279     18247   4.45409   1.33984  0.00483873

> >> > >>>>>> 0.0267929

> >> > >>>>>>    17      32     18372     18340   4.21346  0.363281  0.00505187

> >> > >>>>>> 0.0275887

> >> > >>>>>>    18      32     19403     19371   4.20309   4.02734  0.00545154

> >> > >>>>>>  0.029348

> >> > >>>>>>    19      31     19845     19814   4.07295   1.73047  0.00254726

> >> > >>>>>> 0.0306775

> >> > >>>>>> 2017-10-18 10:57:58.160536 min lat: 0.0015005 max lat: 2.27707

> >> avg

> >> > >>>>>> lat: 0.0307559

> >> > >>>>>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)

> >> > >>>>>>  avg lat(s)

> >> > >>>>>>    20      31     20401     20370   3.97788   2.17188  0.00307238

> >> > >>>>>> 0.0307559

> >> > >>>>>>    21      32     21338     21306   3.96254   3.65625  0.00464563

> >> > >>>>>> 0.0312288

> >> > >>>>>>    22      32     23057     23025    4.0876   6.71484  0.00296295

> >> > >>>>>> 0.0299267

> >> > >>>>>>    23      32     23057     23025   3.90988         0           -

> >> > >>>>>> 0.0299267

> >> > >>>>>>    24      32     23803     23771   3.86837   1.45703  0.00301471

> >> > >>>>>> 0.0312804

> >> > >>>>>>    25      32     24112     24080   3.76191   1.20703  0.00191063

> >> > >>>>>> 0.0331462

> >> > >>>>>>    26      31     25303     25272   3.79629   4.65625  0.00794399

> >> > >>>>>> 0.0329129

> >> > >>>>>>    27      32     28803     28771   4.16183    13.668   0.0109817

> >> > >>>>>> 0.0297469

> >> > >>>>>>    28      32     29592     29560   4.12325   3.08203  0.00188185

> >> > >>>>>> 0.0301911

> >> > >>>>>>    29      32     30595     30563   4.11616   3.91797  0.00379099

> >> > >>>>>> 0.0296794

> >> > >>>>>>    30      32     31031     30999   4.03572   1.70312  0.00283347

> >> > >>>>>> 0.0302411

> >> > >>>>>> Total time run:         30.822350

> >> > >>>>>> Total writes made:      31032

> >> > >>>>>> Write size:             4096

> >> > >>>>>> Object size:            4096

> >> > >>>>>> Bandwidth (MB/sec):     3.93282

> >> > >>>>>> Stddev Bandwidth:       3.66265

> >> > >>>>>> Max bandwidth (MB/sec): 13.668

> >> > >>>>>> Min bandwidth (MB/sec): 0

> >> > >>>>>> Average IOPS:           1006

> >> > >>>>>> Stddev IOPS:            937

> >> > >>>>>> Max IOPS:               3499

> >> > >>>>>> Min IOPS:               0

> >> > >>>>>> Average Latency(s):     0.0317779

> >> > >>>>>> Stddev Latency(s):      0.164076

> >> > >>>>>> Max latency(s):         2.27707

> >> > >>>>>> Min latency(s):         0.0013848

> >> > >>>>>> Cleaning up (deleting benchmark objects)

> >> > >>>>>> Clean up completed and total clean up time :20.166559

> >> > >>>>>>

> >> > >>>>>>

> >> > >>>>>>

> >> > >>>>>>

> >> > >>>>>> On Wed, Oct 18, 2017 at 8:51 AM, Maged Mokhtar <

> >> mmokhtar@xxxxxxxxxxx>

> >> > >>>>>> wrote:

> >> > >>>>>>

> >> > >>>>>>> First a general comment: local RAID will be faster than Ceph

> >> for a

> >> > >>>>>>> single threaded (queue depth=1) io operation test. A single

> >> thread Ceph

> >> > >>>>>>> client will see at best same disk speed for reads and for

> >> writes 4-6 times

> >> > >>>>>>> slower than single disk. Not to mention the latency of local

> >> disks will

> >> > >>>>>>> much better. Where Ceph shines is when you have many concurrent

> >> ios, it

> >> > >>>>>>> scales whereas RAID will decrease speed per client as you add

> >> more.

> >> > >>>>>>>

> >> > >>>>>>> Having said that, i would recommend running rados/rbd

> >> bench-write

> >> > >>>>>>> and measure 4k iops at 1 and 32 threads to get a better idea of

> >> how your

> >> > >>>>>>> cluster performs:

> >> > >>>>>>>

> >> > >>>>>>> ceph osd pool create testpool 256 256

> >> > >>>>>>> rados bench -p testpool -b 4096 30 write -t 1

> >> > >>>>>>> rados bench -p testpool -b 4096 30 write -t 32

> >> > >>>>>>> ceph osd pool delete testpool testpool

> >> --yes-i-really-really-mean-it

> >> > >>>>>>>

> >> > >>>>>>> rbd bench-write test-image --io-threads=1 --io-size 4096

> >> > >>>>>>> --io-pattern rand --rbd_cache=false

> >> > >>>>>>> rbd bench-write test-image --io-threads=32 --io-size 4096

> >> > >>>>>>> --io-pattern rand --rbd_cache=false

> >> > >>>>>>>

> >> > >>>>>>> I think the request size difference you see is due to the io

> >> > >>>>>>> scheduler in the case of local disks having more ios to

> >> re-group so has a

> >> > >>>>>>> better chance in generating larger requests. Depending on your

> >> kernel, the

> >> > >>>>>>> io scheduler may be different for rbd (blq-mq) vs sdx (cfq) but

> >> again i

> >> > >>>>>>> would think the request size is a result not a cause.

> >> > >>>>>>>

> >> > >>>>>>> Maged

> >> > >>>>>>>

> >> > >>>>>>> On 2017-10-17 23:12, Russell Glaue wrote:

> >> > >>>>>>>

> >> > >>>>>>> I am running ceph jewel on 5 nodes with SSD OSDs.

> >> > >>>>>>> I have an LVM image on a local RAID of spinning disks.

> >> > >>>>>>> I have an RBD image on in a pool of SSD disks.

> >> > >>>>>>> Both disks are used to run an almost identical CentOS 7 system.

> >> > >>>>>>> Both systems were installed with the same kickstart, though the

> >> disk

> >> > >>>>>>> partitioning is different.

> >> > >>>>>>>

> >> > >>>>>>> I want to make writes on the the ceph image faster. For example,

> >> > >>>>>>> lots of writes to MySQL (via MySQL replication) on a ceph SSD

> >> image are

> >> > >>>>>>> about 10x slower than on a spindle RAID disk image. The MySQL

> >> server on

> >> > >>>>>>> ceph rbd image has a hard time keeping up in replication.

> >> > >>>>>>>

> >> > >>>>>>> So I wanted to test writes on these two systems

> >> > >>>>>>> I have a 10GB compressed (gzip) file on both servers.

> >> > >>>>>>> I simply gunzip the file on both systems, while running iostat.

> >> > >>>>>>>

> >> > >>>>>>> The primary difference I see in the results is the average size

> >> of

> >> > >>>>>>> the request to the disk.

> >> > >>>>>>> CentOS7-lvm-raid-sata writes a lot faster to disk, and the size

> >> of

> >> > >>>>>>> the request is about 40x, but the number of writes per second

> >> is about the

> >> > >>>>>>> same

> >> > >>>>>>> This makes me want to conclude that the smaller size of the

> >> request

> >> > >>>>>>> for CentOS7-ceph-rbd-ssd system is the cause of it being slow.

> >> > >>>>>>>

> >> > >>>>>>>

> >> > >>>>>>> How can I make the size of the request larger for ceph rbd

> >> images,

> >> > >>>>>>> so I can increase the write throughput?

> >> > >>>>>>> Would this be related to having jumbo packets enabled in my ceph

> >> > >>>>>>> storage network?

> >> > >>>>>>>

> >> > >>>>>>>

> >> > >>>>>>> Here is a sample of the results:

> >> > >>>>>>>

> >> > >>>>>>> [CentOS7-lvm-raid-sata]

> >> > >>>>>>> $ gunzip large10gFile.gz &

> >> > >>>>>>> $ iostat -x vg_root-lv_var -d 5 -m -N

> >> > >>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s

> >> wMB/s

> >> > >>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util

> >> > >>>>>>> ...

> >> > >>>>>>> vg_root-lv_var     0.00     0.00   30.60  452.20    13.60

> >>  222.15

> >> > >>>>>>>  1000.04     8.69   14.05    0.99   14.93   2.07 100.04

> >> > >>>>>>> vg_root-lv_var     0.00     0.00   88.20  182.00    39.20

> >> 89.43

> >> > >>>>>>> 974.95     4.65    9.82    0.99   14.10   3.70 100.00

> >> > >>>>>>> vg_root-lv_var     0.00     0.00   75.45  278.24    33.53

> >>  136.70

> >> > >>>>>>> 985.73     4.36   33.26    1.34   41.91   0.59  20.84

> >> > >>>>>>> vg_root-lv_var     0.00     0.00  111.60  181.80    49.60

> >> 89.34

> >> > >>>>>>> 969.84     2.60    8.87    0.81   13.81   0.13   3.90

> >> > >>>>>>> vg_root-lv_var     0.00     0.00   68.40  109.60    30.40

> >> 53.63

> >> > >>>>>>> 966.87     1.51    8.46    0.84   13.22   0.80  14.16

> >> > >>>>>>> ...

> >> > >>>>>>>

> >> > >>>>>>> [CentOS7-ceph-rbd-ssd]

> >> > >>>>>>> $ gunzip large10gFile.gz &

> >> > >>>>>>> $ iostat -x vg_root-lv_data -d 5 -m -N

> >> > >>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s

> >> wMB/s

> >> > >>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util

> >> > >>>>>>> ...

> >> > >>>>>>> vg_root-lv_data     0.00     0.00   46.40  167.80     0.88

> >>  1.46

> >> > >>>>>>>    22.36     1.23    5.66    2.47    6.54   4.52  96.82

> >> > >>>>>>> vg_root-lv_data     0.00     0.00   16.60   55.20     0.36

> >>  0.14

> >> > >>>>>>>    14.44     0.99   13.91    9.12   15.36  13.71  98.46

> >> > >>>>>>> vg_root-lv_data     0.00     0.00   69.00  173.80     1.34

> >>  1.32

> >> > >>>>>>>    22.48     1.25    5.19    3.77    5.75   3.94  95.68

> >> > >>>>>>> vg_root-lv_data     0.00     0.00   74.40  293.40     1.37

> >>  1.47

> >> > >>>>>>>    15.83     1.22    3.31    2.06    3.63   2.54  93.26

> >> > >>>>>>> vg_root-lv_data     0.00     0.00   90.80  359.00     1.96

> >>  3.41

> >> > >>>>>>>    24.45     1.63    3.63    1.94    4.05   2.10  94.38

> >> > >>>>>>> ...

> >> > >>>>>>>

> >> > >>>>>>> [iostat key]

> >> > >>>>>>> w/s == The number (after merges) of write requests completed per

> >> > >>>>>>> second for the device.

> >> > >>>>>>> wMB/s == The number of sectors (kilobytes, megabytes) written

> >> to the

> >> > >>>>>>> device per second.

> >> > >>>>>>> avgrq-sz == The average size (in kilobytes) of the requests that

> >> > >>>>>>> were issued to the device.

> >> > >>>>>>> avgqu-sz == The average queue length of the requests that were

> >> > >>>>>>> issued to the device.

> >> > >>>>>>>

> >> > >>>>>>>

> >> > >>>>>>> _______________________________________________

> >> > >>>>>>> ceph-users mailing list

> >> > >>>>>>> ceph-users@xxxxxxxxxxxxxx

> >> > >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >> > >>>>>>>

> >> > >>>>>>>

> >> > >>>>>>>

> >> > >>>>>>>

> >> > >>>>>>

> >> > >>>>>>

> >> > >>>>>>

> >> > >>>>>

> >> > >>>>>

> >> > >>>>>

> >> > >>>>

> >> > >>>>

> >> > >>>>

> >> > >>>

> >> > >>> _______________________________________________

> >> > >>> ceph-users mailing list

> >> > >>> ceph-users@xxxxxxxxxxxxxx

> >> > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >> > >>>

> >> > >>

> >> > > _______________________________________________

> >> > > ceph-users mailing list

> >> > > ceph-users@xxxxxxxxxxxxxx

> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >> > >

> >> > >

> >> > >

> >>

> >>

> >> --

> >> Christian Balzer        Network/Systems Engineer

> >> chibi@xxxxxxx           Rakuten Communications

> >> _______________________________________________

> >> ceph-users mailing list

> >> ceph-users@xxxxxxxxxxxxxx

> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >>

> >

> > _______________________________________________

> > ceph-users mailing list

> > ceph-users@xxxxxxxxxxxxxx

> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> >

> >

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Rakuten Communications

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com