Re: performance in a small cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Sat, May 25, 2019 at 12:30 AM Mark Lehrer <lehrer@xxxxxxxxx> wrote:
> but only 20MB/s write and 95MB/s read with 4KB objects.

There is copy-on-write overhead for each block, so 4K performance is
going to be limited no matter what.

no snapshots are involved and he's using rados bench which operates on
block sizes as specified, so no partial updates are involved

This workload basically goes straight into the WAL for up to 512 MB, so it's
virtually identical to running the standard fio benchmark for ceph disks.
 

However, if your system is like mine the main problem you will run
into is that Ceph was designed for spinning disks.  Therefore, its
main goal is to make sure that no individual OSD is doing more than
one or two things at a time no matter what.  Unfortunately, SSDs
typically don't show best performance until you are doing 20+
simultaneous I/Os (especially if you use a small block size).

No, there are different defaults for number of threads and other tuning
parameters since Luminous.
 

You can see this most clearly with iostat (run "iostat -mtxy 1" on one
of your OSD nodes) and a high queue depth 4K workload.  You'll notice
that even though the client is trying to do many things at a time, the
OSD node is practically idle.  Especially problematic is the fact that
iostat will stay below 1 in the "avgqu-sz" column and the utilization
% will be very low.  This makes it look like a thread semaphore kind
of problem to me... and increasing the number of clients doesn't seem
to make the OSDs work any harder.

RocksDB WAL uses 4 threads/WALs by default IIRC, you can change that
in bluestore_rocksdb_options. Yes, that is often a bottleneck and is one of
the standard options to tune to get the most IOPS out of NVMe disks.
Well, that and creating more partitions/OSDs on a single disk.


But the main problem is that you want to write your data for real. Many
SSDs are just bad at writing small chunks of data.
These benchmark results simply look like a case of a slow disk.
 

I still haven't found a good solution unfortunately but definitely
keep an eye on the queue size and util% in iostat -- SSD bandwidth &
iops depend on maximizing the number of parallel I/O operations.  If
anyone has hints on improving Ceph threading I would love to figure
this one out.

Agreed, everyone should monitor util%



--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
 


On Fri, May 24, 2019 at 5:23 AM Robert Sander
<r.sander@xxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> we have a small cluster at a customer's site with three nodes and 4
> SSD-OSDs each.
> Connected with 10G the system is supposed to perform well.
>
> rados bench shows ~450MB/s write and ~950MB/s read speeds with 4MB
> objects but only 20MB/s write and 95MB/s read with 4KB objects.
>
> This is a little bit disappointing as the 4K performance is also seen in
> KVM VMs using RBD.
>
> Is there anything we can do to improve performance with small objects /
> block sizes?
>
> Jumbo frames have already been enabled.
>
> 4MB objects write:
>
> Total time run:         30.218930
> Total writes made:      3391
> Write size:             4194304
> Object size:            4194304
> Bandwidth (MB/sec):     448.858
> Stddev Bandwidth:       63.5044
> Max bandwidth (MB/sec): 552
> Min bandwidth (MB/sec): 320
> Average IOPS:           112
> Stddev IOPS:            15
> Max IOPS:               138
> Min IOPS:               80
> Average Latency(s):     0.142475
> Stddev Latency(s):      0.0990132
> Max latency(s):         0.814715
> Min latency(s):         0.0308732
>
> 4MB objects rand read:
>
> Total time run:       30.169312
> Total reads made:     7223
> Read size:            4194304
> Object size:          4194304
> Bandwidth (MB/sec):   957.662
> Average IOPS:         239
> Stddev IOPS:          23
> Max IOPS:             272
> Min IOPS:             175
> Average Latency(s):   0.0653696
> Max latency(s):       0.517275
> Min latency(s):       0.00201978
>
> 4K objects write:
>
> Total time run:         30.002628
> Total writes made:      165404
> Write size:             4096
> Object size:            4096
> Bandwidth (MB/sec):     21.5351
> Stddev Bandwidth:       2.0575
> Max bandwidth (MB/sec): 22.4727
> Min bandwidth (MB/sec): 11.0508
> Average IOPS:           5512
> Stddev IOPS:            526
> Max IOPS:               5753
> Min IOPS:               2829
> Average Latency(s):     0.00290095
> Stddev Latency(s):      0.0015036
> Max latency(s):         0.0778454
> Min latency(s):         0.00174262
>
> 4K objects read:
>
> Total time run:       30.000538
> Total reads made:     1064610
> Read size:            4096
> Object size:          4096
> Bandwidth (MB/sec):   138.619
> Average IOPS:         35486
> Stddev IOPS:          3776
> Max IOPS:             42208
> Min IOPS:             26264
> Average Latency(s):   0.000443905
> Max latency(s):       0.0123462
> Min latency(s):       0.000123081
>
>
> Regards
> --
> Robert Sander
> Heinlein Support GmbH
> Linux: Akademie - Support - Hosting
> http://www.heinlein-support.de
>
> Tel: 030-405051-43
> Fax: 030-405051-19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux