Re: performance in a small cluster

Paul Emmerich <paul.emmerich@xxxxxxxx> · Sat, 25 May 2019 01:38:04 +0200

On Sat, May 25, 2019 at 12:30 AM Mark Lehrer <lehrer@xxxxxxxxx> wrote:
> but only 20MB/s write and 95MB/s read with 4KB objects.

There is copy-on-write overhead for each block, so 4K performance is

going to be limited no matter what.

no snapshots are involved and he's using rados bench which operates on
block sizes as specified, so no partial updates are involved

This workload basically goes straight into the WAL for up to 512 MB, so it's
virtually identical to running the standard fio benchmark for ceph disks.

However, if your system is like mine the main problem you will run

into is that Ceph was designed for spinning disks.  Therefore, its

main goal is to make sure that no individual OSD is doing more than

one or two things at a time no matter what.  Unfortunately, SSDs

typically don't show best performance until you are doing 20+

simultaneous I/Os (especially if you use a small block size).

No, there are different defaults for number of threads and other tuning
parameters since Luminous.

You can see this most clearly with iostat (run "iostat -mtxy 1" on one

of your OSD nodes) and a high queue depth 4K workload.  You'll notice

that even though the client is trying to do many things at a time, the

OSD node is practically idle.  Especially problematic is the fact that

iostat will stay below 1 in the "avgqu-sz" column and the utilization

% will be very low.  This makes it look like a thread semaphore kind

of problem to me... and increasing the number of clients doesn't seem

to make the OSDs work any harder.

RocksDB WAL uses 4 threads/WALs by default IIRC, you can change that
in bluestore_rocksdb_options. Yes, that is often a bottleneck and is one of
the standard options to tune to get the most IOPS out of NVMe disks.
Well, that and creating more partitions/OSDs on a single disk.

But the main problem is that you want to write your data for real. Many
SSDs are just bad at writing small chunks of data.
These benchmark results simply look like a case of a slow disk.

I still haven't found a good solution unfortunately but definitely

keep an eye on the queue size and util% in iostat -- SSD bandwidth &

iops depend on maximizing the number of parallel I/O operations.  If

anyone has hints on improving Ceph threading I would love to figure

this one out.

Agreed, everyone should monitor util%

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, May 24, 2019 at 5:23 AM Robert Sander

<r.sander@xxxxxxxxxxxxxxxxxxx> wrote:

>

> Hi,

>

> we have a small cluster at a customer's site with three nodes and 4

> SSD-OSDs each.

> Connected with 10G the system is supposed to perform well.

>

> rados bench shows ~450MB/s write and ~950MB/s read speeds with 4MB

> objects but only 20MB/s write and 95MB/s read with 4KB objects.

>

> This is a little bit disappointing as the 4K performance is also seen in

> KVM VMs using RBD.

>

> Is there anything we can do to improve performance with small objects /

> block sizes?

>

> Jumbo frames have already been enabled.

>

> 4MB objects write:

>

> Total time run:         30.218930

> Total writes made:      3391

> Write size:             4194304

> Object size:            4194304

> Bandwidth (MB/sec):     448.858

> Stddev Bandwidth:       63.5044

> Max bandwidth (MB/sec): 552

> Min bandwidth (MB/sec): 320

> Average IOPS:           112

> Stddev IOPS:            15

> Max IOPS:               138

> Min IOPS:               80

> Average Latency(s):     0.142475

> Stddev Latency(s):      0.0990132

> Max latency(s):         0.814715

> Min latency(s):         0.0308732

>

> 4MB objects rand read:

>

> Total time run:       30.169312

> Total reads made:     7223

> Read size:            4194304

> Object size:          4194304

> Bandwidth (MB/sec):   957.662

> Average IOPS:         239

> Stddev IOPS:          23

> Max IOPS:             272

> Min IOPS:             175

> Average Latency(s):   0.0653696

> Max latency(s):       0.517275

> Min latency(s):       0.00201978

>

> 4K objects write:

>

> Total time run:         30.002628

> Total writes made:      165404

> Write size:             4096

> Object size:            4096

> Bandwidth (MB/sec):     21.5351

> Stddev Bandwidth:       2.0575

> Max bandwidth (MB/sec): 22.4727

> Min bandwidth (MB/sec): 11.0508

> Average IOPS:           5512

> Stddev IOPS:            526

> Max IOPS:               5753

> Min IOPS:               2829

> Average Latency(s):     0.00290095

> Stddev Latency(s):      0.0015036

> Max latency(s):         0.0778454

> Min latency(s):         0.00174262

>

> 4K objects read:

>

> Total time run:       30.000538

> Total reads made:     1064610

> Read size:            4096

> Object size:          4096

> Bandwidth (MB/sec):   138.619

> Average IOPS:         35486

> Stddev IOPS:          3776

> Max IOPS:             42208

> Min IOPS:             26264

> Average Latency(s):   0.000443905

> Max latency(s):       0.0123462

> Min latency(s):       0.000123081

>

>

> Regards

> --

> Robert Sander

> Heinlein Support GmbH

> Linux: Akademie - Support - Hosting

> http://www.heinlein-support.de

>

> Tel: 030-405051-43

> Fax: 030-405051-19

>

> Zwangsangaben lt. §35a GmbHG:

> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,

> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com