Hi, I changed to only use the infiniband network. For the 4KB write, the IOPS doesn’t improve much. I also logged into the OSD nodes and atop showed the disks are not always at 100% busy. Please check a snapshot of one node below:
DSK | sdc | busy 72% | read 20/s | write 86/s | KiB/w 13 | MBr/s 0.16 | MBw/s 1.12 | avio 6.69 ms | DSK | sda | busy 47% | read 0/s | write 589/s | KiB/w 4 | MBr/s 0.00 | MBw/s 2.83 | avio 0.79 ms | DSK | sdb | busy 31% | read 14/s | write 77/s | KiB/w 10 | MBr/s 0.11 | MBw/s 0.76 | avio 3.42 ms | DSK | sdd | busy 19% | read 4/s | write 50/s | KiB/w 11 | MBr/s 0.03 | MBw/s 0.55 | avio 3.40 ms | NET | transport | tcpi 656/s | tcpo 655/s | udpi 0/s | udpo 0/s | tcpao 0/s | tcppo 0/s | tcprs 0/s | NET | network | ipi 657/s | ipo 655/s | ipfrw 0/s | deliv 657/s | | icmpi 0/s | icmpo 0/s | NET | p10p1 0% | pcki 0/s | pcko 0/s | si 0 Kbps | so 1 Kbps | erri 0/s | erro 0/s | drpo 0/s | NET | ib0 ---- | pcki 637/s | pcko 636/s | si 8006 Kbps | so 5213 Kbps | erri 0/s | erro 0/s | drpo 0/s | NET | lo ---- | pcki 19/s | pcko 19/s | si 14 Kbps | so 14 Kbps | erri 0/s | erro 0/s | drpo 0/s | /dev/sda is the OS and journaling SSD. The other three are OSDs.
Am I missing anything?
Thanks,
Zhang, Di Postdoctoral Associate Baylor College of Medicine
On Jul 13, 2016, at 6:56 PM, Christian Balzer < chibi@xxxxxxx> wrote:
Hello,On Wed, 13 Jul 2016 12:01:14 -0500 Di Zhang wrote:I also tried 4K write bench. The IOPS is ~420.
That's what people usually mean (4KB blocks) when talking about IOPS.This number is pretty low, my guess would be network latency on your 1Gbsnetwork for the most part.You should run atop on your storage nodes will running a test like thisand see if the OSDs (HDDs) are also very busy.Lastly the rados bench gives you some basic numbers but it is not the sameas real client I/O, for that you want to run fio inside a VM or in yourcase on a mounted CephFS.I used to have better bandwidth when I use the same network for both the cluster and clients. Now the bandwidth must be limited by the 1G ethernet.
That's the bandwidth you also see in your 4MB block tests below.For small I/Os the real killer is latency, though.What would you suggest to me to do?
That depends on your budget mostly (switch ports, client NICs).A uniform, single 10Gb/s network would be better in all aspects than thesplit network you have now.ChristianThanks,
On Wed, Jul 13, 2016 at 11:37 AM, Di Zhang <zhangdibio@xxxxxxxxx> wrote:
Hello, Sorry for the misunderstanding about IOPS. Here are some summary stats of my benchmark (Is the 20 - 30 IOPS seems normal to you?):
ceph osd pool create test 512 512
rados bench -p test 10 write --no-cleanup
Total time run: 10.480383 Total writes made: 288 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 109.92 Stddev Bandwidth: 11.9926 Max bandwidth (MB/sec): 124 Min bandwidth (MB/sec): 80 Average IOPS: 27 Stddev IOPS: 3 Max IOPS: 31 Min IOPS: 20 Average Latency(s): 0.579105 Stddev Latency(s): 0.19902 Max latency(s): 1.32831 Min latency(s): 0.245505
rados bench -p bench -p test 10 seq Total time run: 10.340724 Total reads made: 288 Read size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 111.404 Average IOPS 27 Stddev IOPS: 2 Max IOPS: 31 Min IOPS: 22 Average Latency(s): 0.564858 Max latency(s): 1.65278 Min latency(s): 0.141504
rados bench -p bench -p test 10 rand Total time run: 10.546251 Total reads made: 293 Read size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 111.13 Average IOPS: 27 Stddev IOPS: 2 Max IOPS: 32 Min IOPS: 24 Average Latency(s): 0.57092 Max latency(s): 1.8631 Min latency(s): 0.161936
On Tue, Jul 12, 2016 at 9:18 PM, Christian Balzer <chibi@xxxxxxx> wrote:
Hello,
On Tue, 12 Jul 2016 20:57:00 -0500 Di Zhang wrote:
I am using 10G infiniband for cluster network and 1G ethernet for
public. Hmm, very unbalanced, but I guess that's HW you already had.
Because I don't have enough slots on the node, so I am using three
files on
the OS drive (SSD) for journaling, which really improved but not
entirely
solved the problem.
If you can, use partitions instead of files, less overhead. What model SSD is that?
Also putting the meta-data pool on SSDs might help.
I am quite happy with the current IOPS, which range from 200 MB/s to 400 MB/s sequential write, depending on the block size.
That's not IOPS, that's bandwidth, throughput.
But the problem is, when I transfer data to the cephfs at a rate below 100MB/s, I can
observe
the slow/blocked requests warnings after a few minutes via "ceph -w".
I doubt the speed has anything to do with this, but the actual block size and IOPS numbers.
As always, watch your storage nodes with atop (or iostat) during such scenarios/tests and spot the bottlenecks.
It's not specific to any particular OSDs. So I started to doubt if the configuration is correct or upgrading to Jewel can solve it.
Jewel is likely to help in general, but can't fix insufficient HW or broken configurations.
There are about 5,000,000 objects currently in the cluster.
You're robably not hitting his, but read the recent filestore merge and split threads, including the entirety of this thread: https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg29243.html
Christian
Thanks for the hints.
On Tue, Jul 12, 2016 at 8:19 PM, Christian Balzer <chibi@xxxxxxx>
wrote:
Hello,
On Tue, 12 Jul 2016 19:54:38 -0500 Di Zhang wrote:
It's a 5 nodes cluster. Each node has 3 OSDs. I set pg_num = 512
for both
cephfs_data and cephfs_metadata. I experienced some slow/blocked
requests
issues when I was using hammer 0.94.x and prior. So I was thinking
if the
pg_num is too large for metadata.
Very, VERY much doubt this.
Your "ideal" values for a cluster of this size (are you planning to
grow
it?) would be about 1024 PGs for data and 128 or 256 PGs for
meta-data.
Not really that far off and more importantly not overloading the OSDs
with
too many PGs in total. Or do you have more pools?
I just upgraded the cluster to Jewel today. Will watch if the problem remains.
Jewel improvements might mask things, but I'd venture that your
problems
were caused by your HW not being sufficient for the load.
As in, do you use SSD journals, etc? How many IOPS do you need/expect from your CephFS? How many objects are in there?
Christian
Thank you.
On Tue, Jul 12, 2016 at 6:45 PM, Gregory Farnum <gfarnum@xxxxxxxxxx
wrote:
I'm not at all sure that rados cppool actually captures
everything (it
might). Doug has been working on some similar stuff for disaster recovery testing and can probably walk you through moving over.
But just how large *is* your metadata pool in relation to others? Having a too-large pool doesn't cost much unless it's grossly-inflated, and having a nice distribution of your folders
is
definitely better than not. -Greg
On Tue, Jul 12, 2016 at 4:14 PM, Di Zhang <zhangdibio@xxxxxxxxx>
wrote:
Hi,
Is there any way to change the metadata pool for a cephfs
without
losing
any existing data? I know how to clone the metadata pool using
rados
cppool.
But the filesystem still links to the original metadata pool no
matter
what
you name it.
The motivation here is to decrease the pg_num of the
metadata
pool. I
created this cephfs cluster sometime ago, while I didn't realize
that I
shouldn't assign a large pg_num to such a small pool.
I'm not sure if I can delete the fs and re-create it using
the
existing
data pool and the cloned metadata pool.
Thank you.
Zhang Di
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/
-- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/
-- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communicationshttp://www.gol.com/
|