Here are some random samples I recorded in the past 30 minutes.11 K blocks 10542 kB/s 909 op/s12 K blocks 15397 kB/s 1247 op/s26 K blocks 34306 kB/s 1307 op/s33 K blocks 48509 kB/s 1465 op/s59 K blocks 59333 kB/s 999 op/s172 K blocks 101939 kB/s 590 op/s104 K blocks 82605 kB/s 788 op/s128 K blocks 77454 kB/s 601 op/s136 K blocks 47526 kB/s 348 op/sOn Fri, Dec 8, 2017 at 2:04 AM, Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:4M block sizes you will only need 22.5 iops
On 2017-12-08 09:59, Maged Mokhtar wrote:
Hi Russell,
It is probably due to the difference in block sizes used in the test vs your cluster load. You have a latency problem which is limiting your max write iops to around 2.5K. For large block sizes you do not need that many iops, for example if you write in 4M block sizes you will only need 12.5 iops to reach your bandwidth of 90 MB/s, in such case you latency problem will not affect your bandwidth. The reason i had suggested you run the original test in 4k size was because this was the original problem subject of this thread, the gunzip test and the small block sizes you were getting with iostat.
If you want to know a "rough" ballpark on what block sizes you currently see on your cluster, get the total bandwidth and iops as reported by ceph ( ceph status should give you this ) and divide the first by the second.
I still think you have a significant latency/iops issue: a 36 all SSDs cluster should give much higher that 2.5K iops
Maged
On 2017-12-07 23:57, Russell Glaue wrote:
I want to provide an update to my interesting situation.(New storage nodes were purchased and are going into the cluster soon)I have been monitoring the ceph storage nodes with atop and read/write through put with ceph-dash for the last month.I am regularly seeing 80-90MB/s of write throughput (140MB/s read) on the ceph cluster. At these moments, the problem ceph node I have been speaking of shows 101% disk busy on the same 3 to 4 (of the 9) OSDs. So I am getting the throughput that I want with on the cluster, despite the OSDs in question.However, when I run the bench tests described in this thread, I do not see the write throughput go above 5MB/s.When I take the problem node out, and run the bench tests, I see the throughput double, but not over 10MB/s.Why is the ceph cluster getting up to 90MB/s write in the wild, but not when running the bench tests ?-RG
On Fri, Oct 27, 2017 at 4:21 PM, Russell Glaue <rglaue@xxxxxxxx> wrote:
Yes, several have recommended the fio test now.I cannot perform a fio test at this time. Because the post referred to directs us to write the fio test data directly to the disk device, e.g. /dev/sdj. I'd have to take an OSD completely out in order to perform the test. And I am not ready to do that at this time. Perhaps after I attempt the hardware firmware updates, and still do not have an answer, I would then take an OSD out of the cluster to run the fio test.Also, our M500 disks on the two newest machines are all running version MU05, the latest firmware. The on the older two, they are behind a RAID0, but I suspect they might be MU03 firmware.
-RG
On Fri, Oct 27, 2017 at 4:12 PM, Brian Andrus <brian.andrus@xxxxxxxxxxxxx> wrote:
I would be interested in seeing the results from the post mentioned by an earlier contributor:Test an "old" M500 and a "new" M500 and see if the performance is A) acceptable and B) comparable. Find hardware revision or firmware revision in case of A=Good and B=different.If the "old" device doesn't test well in fio/dd testing, then the drives are (as expected) not a great choice for journals and you might want to look at hardware/backplane/RAID configuration differences that are somehow allowing them to perform adequately.
On Fri, Oct 27, 2017 at 12:36 PM, Russell Glaue <rglaue@xxxxxxxx> wrote:
Yes, all the MD500s we use are both journal and OSD, even the older ones. We have a 3 year lifecycle and move older nodes from one ceph cluster to another.On old systems with 3 year old MD500s, they run as RAID0, and run faster than our current problem system with 1 year old MD500s, ran as nonraid pass-through on the controller.All disks are SATA and are connected to a SAS controller. We were wondering if the SAS/SATA conversion is an issue. Yet, the older systems don't exhibit a problem.I found what I wanted to know from a colleague, that when the current ceph cluster was put together, the SSDs tested at 300+MB/s, and ceph cluster writes at 30MB/s.Using SMART tools, the reserved cells in all drives is nearly 100%.Restarting the OSDs minorly improved performance. Still betting on hardware issues that a firmware upgrade may resolve.-RGOn Oct 27, 2017 1:14 PM, "Brian Andrus" <brian.andrus@xxxxxxxxxxxxx> wrote:@Russel, are your "older Crucial M500"s being used as journals?Crucial M500s are not to be used as a Ceph journal in my last experience with them. They make good OSDs with an NVMe in front of them perhaps, but not much else.Ceph uses O_DSYNC for journal writes and these drives do not handle them as expected. It's been many years since I've dealt with the M500s specifically, but it has to do with the capacitor/power save feature and how it handles those types of writes. I'm sorry I don't have the emails with specifics around anymore, but last I remember, this was a hardware issue and could not be resolved with firmware.Paging Kyle Bader...
On Fri, Oct 27, 2017 at 9:24 AM, Russell Glaue <rglaue@xxxxxxxx> wrote:
We have older crucial M500 disks operating without such problems. So, I have to believe it is a hardware firmware issue.And its peculiar seeing performance boost slightly, even 24 hours later, when I stop then start the OSDs.Our actual writes are low, as most of our Ceph Cluster based images are low-write, high-memory. So a 20GB/day life/write capacity is a non-issue for us. Only write speed is the concern. Our write-intensive images are locked on non-ceph disks.What are others using for SSD drives in their Ceph cluster?With 0.50+ DWPD (Drive Writes Per Day), the Kingston SEDC400S37 models seems to be the best for the price today.
On Fri, Oct 27, 2017 at 6:34 AM, Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:
It is quiet likely related, things are pointing to bad disks. Probably the best thing is to plan for disk replacement, the sooner the better as it could get worse.
On 2017-10-27 02:22, Christian Wuerdig wrote:
Hm, no necessarily directly related to your performance problem,
however: These SSDs have a listed endurance of 72TB total data written
- over a 5 year period that's 40GB a day or approx 0.04 DWPD. Given
that you run the journal for each OSD on the same disk, that's
effectively at most 0.02 DWPD (about 20GB per day per disk). I don't
know many who'd run a cluster on disks like those. Also it means these
are pure consumer drives which have a habit of exhibiting random
performance at times (based on unquantified anecdotal personal
experience with other consumer model SSDs). I wouldn't touch these
with a long stick for anything but small toy-test clusters.
On Fri, Oct 27, 2017 at 3:44 AM, Russell Glaue <rglaue@xxxxxxxx> wrote:
On Wed, Oct 25, 2017 at 7:09 PM, Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:
It depends on what stage you are in:
in production, probably the best thing is to setup a monitoring tool
(collectd/grahite/prometheus/grafana) to monitor both ceph stats as well as
resource load. This will, among other things, show you if you have slowing
disks.
I am monitoring Ceph performance with ceph-dash
(http://cephdash.crapworks.de/), that is why I knew to look into the slow
writes issue. And I am using Monitorix (http://www.monitorix.org/) to
monitor system resources, including Disk I/O.
However, though I can monitor individual disk performance at the system
level, it seems Ceph does not tax any disk more than the worst disk. So in
my monitoring charts, all disks have the same performance.
All four nodes are base-lining at 50 writes/sec during the cluster's normal
load, with the non-problem hosts spiking up to 150, and the problem host
only spikes up to 100.
But during the window of time I took the problem host OSDs down to run the
bench tests, the OSDs on the other nodes increased to 300-500 writes/sec.
Otherwise, the chart looks the same for all disks on all ceph nodes/hosts.
Before production you should first make sure your SSDs are suitable for
Ceph, either by being recommend by other Ceph users or you test them
yourself for sync writes performance using fio tool as outlined earlier.
Then after you build your cluster you can use rados and/or rbd bencmark
tests to benchmark your cluster and find bottlenecks using atop/sar/collectl
which will help you tune your cluster.
All 36 OSDs are: Crucial_CT960M500SSD1
Rados bench tests were done at the beginning. The speed was much faster than
it is now. I cannot recall the test results, someone else on my team ran
them. Recently, I had thought the slow disk problem was a configuration
issue with Ceph - before I posted here. Now we are hoping it may be resolved
with a firmware update. (If it is firmware related, rebooting the problem
node may temporarily resolve this)
Though you did see better improvements, your cluster with 27 SSDs should
give much higher numbers than 3k iops. If you are running rados bench while
you have other client ios, then obviously the reported number by the tool
will be less than what the cluster is actually giving...which you can find
out via ceph status command, it will print the total cluster throughput and
iops. If the total is still low i would recommend running the fio raw disk
test, maybe the disks are not suitable. When you removed your 9 bad disk
from 36 and your performance doubled, you still had 2 other disk slowing
you..meaning near 100% busy ? It makes me feel the disk type used is not
good. For these near 100% busy disks can you also measure their raw disk
iops at that load (i am not sure atop shows this, if not use
sat/syssyat/iostat/collecl).
I ran another bench test today with all 36 OSDs up. The overall performance
was improved slightly compared to the original tests. Only 3 OSDs on the
problem host were increasing to 101% disk busy.
The iops reported from ceph status during this bench test ranged from 1.6k
to 3.3k, the test yielding 4k iops.
Yes, the two other OSDs/disks that were the bottleneck were at 101% disk
busy. The other OSD disks on the same host were sailing along at like 50-60%
busy.
All 36 OSD disks are exactly the same disk. They were all purchased at the
same time. All were installed at the same time.
I cannot believe it is a problem with the disk model. A failed/bad disk,
perhaps is possible. But the disk model itself cannot be the problem based
on what I am seeing. If I am seeing bad performance on all disks on one ceph
node/host, but not on another ceph node with these same disks, it has to be
some other factor. This is why I am now guessing a firmware upgrade is
needed.
Also, as I eluded to here earlier. I took down all 9 OSDs in the problem
host yesterday to run the bench test.
Today, with those 9 OSDs back online, I rerun the bench test, I am see 2-3
OSD disks with 101% busy on the problem host, and the other disks are lower
than 80%. So, for whatever reason, shutting down the OSDs and starting them
back up, allowed many (not all) of the OSDs performance to improve on the
problem host.
Maged
On 2017-10-25 23:44, Russell Glaue wrote:
Thanks to all.
I took the OSDs down in the problem host, without shutting down the
machine.
As predicted, our MB/s about doubled.
Using this bench/atop procedure, I found two other OSDs on another host
that are the next bottlenecks.
Is this the only good way to really test the performance of the drives as
OSDs? Is there any other way?
While running the bench on all 36 OSDs, the 9 problem OSDs stuck out. But
two new problem OSDs I just discovered in this recent test of 27 OSDs did
not stick out at all. Because ceph bench distributes the load making only
the very worst denominators show up in atop. So ceph is a slow as your
slowest drive.
It would be really great if I could run the bench test, and some how get
the bench to use only certain OSDs during the test. Then I could run the
test, avoiding the OSDs that I already know is a problem, so I can find the
next worst OSD.
[ the bench test ]
rados bench -p scbench -b 4096 30 write -t 32
[ original results with all 36 OSDs ]
Total time run: 30.822350
Total writes made: 31032
Write size: 4096
Object size: 4096
Bandwidth (MB/sec): 3.93282
Stddev Bandwidth: 3.66265
Max bandwidth (MB/sec): 13.668
Min bandwidth (MB/sec): 0
Average IOPS: 1006
Stddev IOPS: 937
Max IOPS: 3499
Min IOPS: 0
Average Latency(s): 0.0317779
Stddev Latency(s): 0.164076
Max latency(s): 2.27707
Min latency(s): 0.0013848
Cleaning up (deleting benchmark objects)
Clean up completed and total clean up time :20.166559
[ after stopping all of the OSDs (9) on the problem host ]
Total time run: 32.586830
Total writes made: 59491
Write size: 4096
Object size: 4096
Bandwidth (MB/sec): 7.13131
Stddev Bandwidth: 9.78725
Max bandwidth (MB/sec): 29.168
Min bandwidth (MB/sec): 0
Average IOPS: 1825
Stddev IOPS: 2505
Max IOPS: 7467
Min IOPS: 0
Average Latency(s): 0.0173691
Stddev Latency(s): 0.21634
Max latency(s): 6.71283
Min latency(s): 0.00107473
Cleaning up (deleting benchmark objects)
Clean up completed and total clean up time :16.269393
On Fri, Oct 20, 2017 at 1:35 PM, Russell Glaue <rglaue@xxxxxxxx> wrote:
On the machine in question, the 2nd newest, we are using the LSI MegaRAID
SAS-3 3008 [Fury], which allows us a "Non-RAID" option, and has no battery.
The older two use the LSI MegaRAID SAS 2208 [Thunderbolt] I reported
earlier, each single drive configured as RAID0.
Thanks for everyone's help.
I am going to run a 32 thread bench test after taking the 2nd machine out
of the cluster with noout.
After it is out of the cluster, I am expecting the slow write issue will
not surface.
On Fri, Oct 20, 2017 at 5:27 AM, David Turner <drakonstein@xxxxxxxxx>
wrote:
I can attest that the battery in the raid controller is a thing. I'm
used to using lsi controllers, but my current position has hp raid
controllers and we just tracked down 10 of our nodes that had >100ms await
pretty much always were the only 10 nodes in the cluster with failed
batteries on the raid controllers.
On Thu, Oct 19, 2017, 8:15 PM Christian Balzer <chibi@xxxxxxx> wrote:
Hello,
On Thu, 19 Oct 2017 17:14:17 -0500 Russell Glaue wrote:
That is a good idea.
However, a previous rebalancing processes has brought performance of
our
Guest VMs to a slow drag.
Never mind that I'm not sure that these SSDs are particular well suited
for Ceph, your problem is clearly located on that one node.
Not that I think it's the case, but make sure your PG distribution is
not
skewed with many more PGs per OSD on that node.
Once you rule that out my first guess is the RAID controller, you're
running the SSDs are single RAID0s I presume?
If so a either configuration difference or a failed BBU on the
controller
could result in the writeback cache being disabled, which would explain
things beautifully.
As for a temporary test/fix (with reduced redundancy of course), set
noout
(or mon_osd_down_out_subtree_limit accordingly) and turn the slow host
off.
This should result in much better performance than you have now and of
course be the final confirmation of that host being the culprit.
Christian
On Thu, Oct 19, 2017 at 3:55 PM, Jean-Charles Lopez
<jelopez@xxxxxxxxxx>
wrote:
Hi Russell,
as you have 4 servers, assuming you are not doing EC pools, just
stop all
the OSDs on the second questionable server, mark the OSDs on that
server as
out, let the cluster rebalance and when all PGs are active+clean
just
replay the test.
All IOs should then go only to the other 3 servers.
JC
On Oct 19, 2017, at 13:49, Russell Glaue <rglaue@xxxxxxxx> wrote:
No, I have not ruled out the disk controller and backplane making
the
disks slower.
Is there a way I could test that theory, other than swapping out
hardware?
-RG
On Thu, Oct 19, 2017 at 3:44 PM, David Turner
<drakonstein@xxxxxxxxx>
wrote:
Have you ruled out the disk controller and backplane in the server______________________________
running slower?
On Thu, Oct 19, 2017 at 4:42 PM Russell Glaue <rglaue@xxxxxxxx>
wrote:
I ran the test on the Ceph pool, and ran atop on all 4 storage
servers,
as suggested.
Out of the 4 servers:
3 of them performed with 17% to 30% disk %busy, and 11% CPU wait.
Momentarily spiking up to 50% on one server, and 80% on another
The 2nd newest server was almost averaging 90% disk %busy and
150% CPU
wait. And more than momentarily spiking to 101% disk busy and
250% CPU wait.
For this 2nd newest server, this was the statistics for about 8
of 9
disks, with the 9th disk not far behind the others.
I cannot believe all 9 disks are bad
They are the same disks as the newest 1st server,
Crucial_CT960M500SSD1,
and same exact server hardware too.
They were purchased at the same time in the same purchase order
and
arrived at the same time.
So I cannot believe I just happened to put 9 bad disks in one
server,
and 9 good ones in the other.
I know I have Ceph configured exactly the same on all servers
And I am sure I have the hardware settings configured exactly the
same
on the 1st and 2nd servers.
So if I were someone else, I would say it maybe is bad hardware
on the
2nd server.
But the 2nd server is running very well without any hint of a
problem.
Any other ideas or suggestions?
-RG
On Wed, Oct 18, 2017 at 3:40 PM, Maged Mokhtar
<mmokhtar@xxxxxxxxxxx>
wrote:
just run the same 32 threaded rados test as you did before and
this
time run atop while the test is running looking for %busy of
cpu/disks. It
should give an idea if there is a bottleneck in them.
On 2017-10-18 21:35, Russell Glaue wrote:
I cannot run the write test reviewed at the
ceph-how-to-test-if-your-s
sd-is-suitable-as-a-journal-device blog. The tests write
directly to
the raw disk device.
Reading an infile (created with urandom) on one SSD, writing the
outfile to another osd, yields about 17MB/s.
But Isn't this write speed limited by the speed in which in the
dd
infile can be read?
And I assume the best test should be run with no other load.
How does one run the rados bench "as stress"?
-RG
On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar
<mmokhtar@xxxxxxxxxxx>
wrote:
measuring resource load as outlined earlier will show if the
drives
are performing well or not. Also how many osds do you have ?
On 2017-10-18 19:26, Russell Glaue wrote:
The SSD drives are Crucial M500
A Ceph user did some benchmarks and found it had good
performance
https://forum.proxmox.com/threads/ceph-bad-performance-in-
qemu-guests.21551/
However, a user comment from 3 years ago on the blog post you
linked
to says to avoid the Crucial M500
Yet, this performance posting tells that the Crucial M500 is
good.
https://inside.servers.com/ssd-performance-2017-c4307a92dea
On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar
<mmokhtar@xxxxxxxxxxx>
wrote:
Check out the following link: some SSDs perform bad in Ceph
due to
sync writes to journal
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-tes
t-if-your-ssd-is-suitable-as-a-journal-device/
Anther thing that can help is to re-run the rados 32 threads
as
stress and view resource usage using atop (or collectl/sar) to
check for
%busy cpu and %busy disks to give you an idea of what is
holding down your
cluster..for example: if cpu/disk % are all low then check
your
network/switches. If disk %busy is high (90%) for all disks
then your
disks are the bottleneck: which either means you have SSDs
that are not
suitable for Ceph or you have too few disks (which i doubt is
the case). If
only 1 disk %busy is high, there may be something wrong with
this disk
should be removed.
Maged
On 2017-10-18 18:13, Russell Glaue wrote:
In my previous post, in one of my points I was wondering if
the
request size would increase if I enabled jumbo packets.
currently it is
disabled.
@jdillama: The qemu settings for both these two guest
machines, with
RAID/LVM and Ceph/rbd images, are the same. I am not thinking
that changing
the qemu settings of "min_io_size=<limited to
16bits>,opt_io_size=<RBD
image object size>" will directly address the issue.
@mmokhtar: Ok. So you suggest the request size is the result
of the
problem and not the cause of the problem. meaning I should go
after a
different issue.
I have been trying to get write speeds up to what people on
this mail
list are discussing.
It seems that for our configuration, as it matches others, we
should
be getting about 70MB/s write speed.
But we are not getting that.
Single writes to disk are lucky to get 5MB/s to 6MB/s, but are
typically 1MB/s to 2MB/s.
Monitoring the entire Ceph cluster (using
http://cephdash.crapworks.de/), I have seen very rare
momentary
spikes up to 30MB/s.
My storage network is connected via a 10Gb switch
I have 4 storage servers with a LSI Logic MegaRAID SAS 2208
controller
Each storage server has 9 1TB SSD drives, each drive as 1 osd
(no
RAID)
Each drive is one LVM group, with two volumes - one volume for
the
osd, one volume for the journal
Each osd is formatted with xfs
The crush map is simple: default->rack->[host[1..4]->osd] with
an
evenly distributed weight
The redundancy is triple replication
While I have read comments that having the osd and journal on
the
same disk decreases write speed, I have also read that once
past 8 OSDs per
node this is the recommended configuration, however this is
also the reason
why SSD drives are used exclusively for OSDs in the storage
nodes.
None-the-less, I was still expecting write speeds to be above
30MB/s,
not below 6MB/s.
Even at 12x slower than the RAID, using my previously posted
iostat
data set, I should be seeing write speeds that average 10MB/s,
not 2MB/s.
In regards to the rados benchmark tests you asked me to run,
here is
the output:
[centos7]# rados bench -p scbench -b 4096 30 write -t 1
Maintaining 1 concurrent writes of 4096 bytes to objects of
size 4096
for up to 30 seconds or 0 objects
Object prefix: benchmark_data_hamms.sys.cu.cait.org_85049
sec Cur ops started finished avg MB/s cur MB/s last
lat(s)
avg lat(s)
0 0 0 0 0 0
-
0
1 1 201 200 0.78356 0.78125
0.00522307
0.00496574
2 1 469 468 0.915303 1.04688
0.00437497
0.00426141
3 1 741 740 0.964371 1.0625
0.00512853
0.0040434
4 1 888 887 0.866739 0.574219
0.00307699
0.00450177
5 1 1147 1146 0.895725 1.01172
0.00376454
0.0043559
6 1 1325 1324 0.862293 0.695312
0.00459443
0.004525
7 1 1494 1493 0.83339 0.660156
0.00461002
0.00458452
8 1 1736 1735 0.847369 0.945312
0.00253971
0.00460458
9 1 1998 1997 0.866922 1.02344
0.00236573
0.00450172
10 1 2260 2259 0.882563 1.02344
0.00262179
0.00442152
11 1 2526 2525 0.896775 1.03906
0.00336914
0.00435092
12 1 2760 2759 0.898203 0.914062
0.00351827
0.00434491
13 1 3016 3015 0.906025 1
0.00335703
0.00430691
14 1 3257 3256 0.908545 0.941406
0.00332344
0.00429495
15 1 3490 3489 0.908644 0.910156
0.00318815
0.00426387
16 1 3728 3727 0.909952 0.929688
0.0032881
0.00428895
17 1 3986 3985 0.915703 1.00781
0.00274809
0.0042614
18 1 4250 4249 0.922116 1.03125
0.00287411
0.00423214
19 1 4505 4504 0.926003 0.996094
0.00375435
0.00421442
2017-10-18 10:56:31.267173 min lat: 0.00181259 max lat:
0.270553 avg
lat: 0.00420118
sec Cur ops started finished avg MB/s cur MB/s last
lat(s)
avg lat(s)
20 1 4757 4756 0.928915 0.984375
0.00463972
0.00420118
21 1 5009 5008 0.93155 0.984375
0.00360065
0.00418937
22 1 5235 5234 0.929329 0.882812
0.00626214
0.004199
23 1 5500 5499 0.933925 1.03516
0.00466584
0.00417836
24 1 5708 5707 0.928861 0.8125
0.00285727
0.00420146
25 0 5964 5964 0.931858 1.00391
0.00417383
0.0041881
26 1 6216 6215 0.933722 0.980469
0.0041009
0.00417915
27 1 6481 6480 0.937474 1.03516
0.00307484
0.00416118
28 1 6745 6744 0.940819 1.03125
0.00266329
0.00414777
29 1 7003 7002 0.943124 1.00781
0.00305905
0.00413758
30 1 7271 7270 0.946578 1.04688
0.00391017
0.00412238
Total time run: 30.006060
Total writes made: 7272
Write size: 4096
Object size: 4096
Bandwidth (MB/sec): 0.946684
Stddev Bandwidth: 0.123762
Max bandwidth (MB/sec): 1.0625
Min bandwidth (MB/sec): 0.574219
Average IOPS: 242
Stddev IOPS: 31
Max IOPS: 272
Min IOPS: 147
Average Latency(s): 0.00412247
Stddev Latency(s): 0.00648437
Max latency(s): 0.270553
Min latency(s): 0.00175318
Cleaning up (deleting benchmark objects)
Clean up completed and total clean up time :29.069423
[centos7]# rados bench -p scbench -b 4096 30 write -t 32
Maintaining 32 concurrent writes of 4096 bytes to objects of
size
4096 for up to 30 seconds or 0 objects
Object prefix: benchmark_data_hamms.sys.cu.cait.org_86076
sec Cur ops started finished avg MB/s cur MB/s last
lat(s)
avg lat(s)
0 0 0 0 0 0
-
0
1 32 3013 2981 11.6438 11.6445
0.00247906
0.00572026
2 32 5349 5317 10.3834 9.125
0.00246662
0.00932016
3 32 5707 5675 7.3883 1.39844
0.00389774
0.0156726
4 32 5895 5863 5.72481 0.734375
1.13137
0.0167946
5 32 6869 6837 5.34068 3.80469
0.0027652
0.0226577
6 32 8901 8869 5.77306 7.9375
0.0053211
0.0216259
7 32 10800 10768 6.00785 7.41797
0.00358187
0.0207418
8 32 11825 11793 5.75728 4.00391
0.00217575
0.0215494
9 32 12941 12909 5.6019 4.35938
0.00278512
0.0220567
10 32 13317 13285 5.18849 1.46875
0.0034973
0.0240665
11 32 16189 16157 5.73653 11.2188
0.00255841
0.0212708
12 32 16749 16717 5.44077 2.1875
0.00330334
0.0215915
13 32 16756 16724 5.02436 0.0273438
0.00338994
0.021849
14 32 17908 17876 4.98686 4.5
0.00402598
0.0244568
15 32 17936 17904 4.66171 0.109375
0.00375799
0.0245545
16 32 18279 18247 4.45409 1.33984
0.00483873
0.0267929
17 32 18372 18340 4.21346 0.363281
0.00505187
0.0275887
18 32 19403 19371 4.20309 4.02734
0.00545154
0.029348
19 31 19845 19814 4.07295 1.73047
0.00254726
0.0306775
2017-10-18 10:57:58.160536 min lat: 0.0015005 max lat: 2.27707
avg
lat: 0.0307559
sec Cur ops started finished avg MB/s cur MB/s last
lat(s)
avg lat(s)
20 31 20401 20370 3.97788 2.17188
0.00307238
0.0307559
21 32 21338 21306 3.96254 3.65625
0.00464563
0.0312288
22 32 23057 23025 4.0876 6.71484
0.00296295
0.0299267
23 32 23057 23025 3.90988 0
-
0.0299267
24 32 23803 23771 3.86837 1.45703
0.00301471
0.0312804
25 32 24112 24080 3.76191 1.20703
0.00191063
0.0331462
26 31 25303 25272 3.79629 4.65625
0.00794399
0.0329129
27 32 28803 28771 4.16183 13.668
0.0109817
0.0297469
28 32 29592 29560 4.12325 3.08203
0.00188185
0.0301911
29 32 30595 30563 4.11616 3.91797
0.00379099
0.0296794
30 32 31031 30999 4.03572 1.70312
0.00283347
0.0302411
Total time run: 30.822350
Total writes made: 31032
Write size: 4096
Object size: 4096
Bandwidth (MB/sec): 3.93282
Stddev Bandwidth: 3.66265
Max bandwidth (MB/sec): 13.668
Min bandwidth (MB/sec): 0
Average IOPS: 1006
Stddev IOPS: 937
Max IOPS: 3499
Min IOPS: 0
Average Latency(s): 0.0317779
Stddev Latency(s): 0.164076
Max latency(s): 2.27707
Min latency(s): 0.0013848
Cleaning up (deleting benchmark objects)
Clean up completed and total clean up time :20.166559
On Wed, Oct 18, 2017 at 8:51 AM, Maged Mokhtar
<mmokhtar@xxxxxxxxxxx>
wrote:
First a general comment: local RAID will be faster than Ceph
for a
single threaded (queue depth=1) io operation test. A single
thread Ceph
client will see at best same disk speed for reads and for
writes 4-6 times
slower than single disk. Not to mention the latency of local
disks will
much better. Where Ceph shines is when you have many
concurrent ios, it
scales whereas RAID will decrease speed per client as you add
more.
Having said that, i would recommend running rados/rbd
bench-write
and measure 4k iops at 1 and 32 threads to get a better idea
of how your
cluster performs:
ceph osd pool create testpool 256 256
rados bench -p testpool -b 4096 30 write -t 1
rados bench -p testpool -b 4096 30 write -t 32
ceph osd pool delete testpool testpool
--yes-i-really-really-mean-it
rbd bench-write test-image --io-threads=1 --io-size 4096
--io-pattern rand --rbd_cache=false
rbd bench-write test-image --io-threads=32 --io-size 4096
--io-pattern rand --rbd_cache=false
I think the request size difference you see is due to the io
scheduler in the case of local disks having more ios to
re-group so has a
better chance in generating larger requests. Depending on
your kernel, the
io scheduler may be different for rbd (blq-mq) vs sdx (cfq)
but again i
would think the request size is a result not a cause.
Maged
On 2017-10-17 23:12, Russell Glaue wrote:
I am running ceph jewel on 5 nodes with SSD OSDs.
I have an LVM image on a local RAID of spinning disks.
I have an RBD image on in a pool of SSD disks.
Both disks are used to run an almost identical CentOS 7
system.
Both systems were installed with the same kickstart, though
the disk
partitioning is different.
I want to make writes on the the ceph image faster. For
example,
lots of writes to MySQL (via MySQL replication) on a ceph SSD
image are
about 10x slower than on a spindle RAID disk image. The MySQL
server on
ceph rbd image has a hard time keeping up in replication.
So I wanted to test writes on these two systems
I have a 10GB compressed (gzip) file on both servers.
I simply gunzip the file on both systems, while running
iostat.
The primary difference I see in the results is the average
size of
the request to the disk.
CentOS7-lvm-raid-sata writes a lot faster to disk, and the
size of
the request is about 40x, but the number of writes per second
is about the
same
This makes me want to conclude that the smaller size of the
request
for CentOS7-ceph-rbd-ssd system is the cause of it being
slow.
How can I make the size of the request larger for ceph rbd
images,
so I can increase the write throughput?
Would this be related to having jumbo packets enabled in my
ceph
storage network?
Here is a sample of the results:
[CentOS7-lvm-raid-sata]
$ gunzip large10gFile.gz &
$ iostat -x vg_root-lv_var -d 5 -m -N
Device: rrqm/s wrqm/s r/s w/s rMB/s
wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
...
vg_root-lv_var 0.00 0.00 30.60 452.20 13.60
222.15
1000.04 8.69 14.05 0.99 14.93 2.07 100.04
vg_root-lv_var 0.00 0.00 88.20 182.00 39.20
89.43
974.95 4.65 9.82 0.99 14.10 3.70 100.00
vg_root-lv_var 0.00 0.00 75.45 278.24 33.53
136.70
985.73 4.36 33.26 1.34 41.91 0.59 20.84
vg_root-lv_var 0.00 0.00 111.60 181.80 49.60
89.34
969.84 2.60 8.87 0.81 13.81 0.13 3.90
vg_root-lv_var 0.00 0.00 68.40 109.60 30.40
53.63
966.87 1.51 8.46 0.84 13.22 0.80 14.16
...
[CentOS7-ceph-rbd-ssd]
$ gunzip large10gFile.gz &
$ iostat -x vg_root-lv_data -d 5 -m -N
Device: rrqm/s wrqm/s r/s w/s rMB/s
wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
...
vg_root-lv_data 0.00 0.00 46.40 167.80 0.88
1.46
22.36 1.23 5.66 2.47 6.54 4.52 96.82
vg_root-lv_data 0.00 0.00 16.60 55.20 0.36
0.14
14.44 0.99 13.91 9.12 15.36 13.71 98.46
vg_root-lv_data 0.00 0.00 69.00 173.80 1.34
1.32
22.48 1.25 5.19 3.77 5.75 3.94 95.68
vg_root-lv_data 0.00 0.00 74.40 293.40 1.37
1.47
15.83 1.22 3.31 2.06 3.63 2.54 93.26
vg_root-lv_data 0.00 0.00 90.80 359.00 1.96
3.41
24.45 1.63 3.63 1.94 4.05 2.10 94.38
...
[iostat key]
w/s == The number (after merges) of write requests completed
per
second for the device.
wMB/s == The number of sectors (kilobytes, megabytes) written
to the
device per second.
avgrq-sz == The average size (in kilobytes) of the requests
that
were issued to the device.
avgqu-sz == The average queue length of the requests that
were
issued to the device.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Christian Balzer Network/Systems Engineer
chibi@xxxxxxx Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--Brian Andrus | Cloud Systems Engineer | DreamHostbrian.andrus@xxxxxxxxxxxxx | www.dreamhost.com
--
Brian Andrus | Cloud Systems Engineer | DreamHostbrian.andrus@xxxxxxxxxxxxx | www.dreamhost.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com