Re: High ops/s with kRBD and "--object-size 32M"

Nick Fisk <nick@xxxxxxxxxx> · Tue, 29 Nov 2016 10:45:34 -0000

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Alex Gorbachev
Sent: 29 November 2016 04:24
To: Francois Blondel <fblondel@xxxxxxxxxxxx>; Ilya Dryomov <idryomov@xxxxxxxxx>
Cc: ceph-users@xxxxxxxx
Subject: Re:  High ops/s with kRBD and "--object-size 32M"

On Mon, Nov 28, 2016 at 2:59 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
On Mon, Nov 28, 2016 at 6:20 PM, Francois Blondel <fblondel@xxxxxxxxxxxx> wrote:
> Hi *,
>
> I am currently testing different scenarios to try to optimize sequential
> read and write speeds using Kernel RBD.
>
> I have two block devices created with :
>   rbd create block1 --size 500G --pool rbd --image-feature layering
>   rbd create block132m --size 500G --pool rbd --image-feature layering
> --object-size 32M
>
> -> Writing to block1 works quite fine  (about 200ops/s, 310MB/s in average,
> for a 250GB file) (tests running with dd)
> -> Writing to block132m is much slower (about 40MB/s in average), and
> generates high ops/s (seen from a ceph -w) (from 4000 to 13000)
>
> Current test cluster:
>
>      health HEALTH_WARN
>             noscrub,nodeep-scrub,sortbitwise flag(s) set
>      monmap e2: 3 mons at
> {aac=10.113.49.48:6789/0,aad=10.112.33.36:6789/0,aae=10.112.48.60:6789/0}
>             election epoch 26, quorum 0,1,2 aad,aae,aac
>      osdmap e10962: 38 osds: 38 up, 38 in
>             flags noscrub,nodeep-scrub,sortbitwise
>       pgmap v120464: 1024 pgs, 1 pools, 486 GB data, 122 kobjects
>             4245 GB used, 50571 GB / 54816 GB avail
>                 1024 active+clean
>
> The OSDs (using bluestore) have been created using:
>     ceph-disk prepare --zap-disk --bluestore --cluster ceph --cluster-uuid
> XX..XX  /dev/sdX
>
>     ceph -v :   ceph version 10.2.3
> (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>
>
> Does someone have any experience involving "non-standard" RBD "object-size"
> ?
>
> Could this be due to "bluestore", or has someone already encountered that
> issue using "filestore" OSDs ?

It's hard to tell without any additional information: dd command,
iostat or blktrace, probably some OSD logs as well.

A ton of work has gone into bluestore in kraken, mostly on the
performance front - jewel bluestore has little in common with the
current version.

>
> Should switching to an higher RBD "object-size" at least theorycally improve
> seq r/w speeds ?

Well, it really depends on the workload.  It may result in an
improvement in certain cases, but there are many downsides - RADOS (be
it with filestore or bluestore) works much better with smaller objects.

I agree with Jason in that you are probably better off with the
default.  Try experimenting with krbd readahead - bump it to 4M or 8M
or even higher and make sure you have a recent kernel on the client
machine (4.4 or newer).

There were a number of threads on this subject on ceph-users.  Search
for: single thread sequential kernel rbd readahead, or so.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Our experience on a busy production cluster are showing better read and write latency response with 16MB objects. With 4MB objects we were seeing sometimes 20 second delays, with 16MB it is more like 5 seconds at most. There are a few caveats to our current cluster:

- it is made of about 200 NL-SAS 4TB HDDs with Micron SSDs as journals. I have been told that on the 7.2k rpm drives latency jumps after about 200 iops per spindle.

- our workload is 100% VMWare VMs running replicated databases. Now with NfS, but likely still a lot of small IO.

I wonder if we are a corner case. But with 16 MB objects with both iSCSI gateway, as well as NFS we saw a clear improvement in latency and throughput. I will reach to our performance engineer if anyone is interested in details of tests.

Any thoughts on why this is the case? Nick Fisk thought that maybe the thin space allocation overhead was smaller with larger object sizes?

I think there is definite issues when using VMFS+iscsi involving VMFS meta updates causing PG contention, but I puzzled as to why larger objects would help. I’m also puzzled why larger objects would help with NFS as well. I’m interested in seeing why you are seeing this and would love to dig deeper. There’s also issues relating to vmware storage migration as vmware tries to write 32 64kb threads all to the same object, which causes massive PG contention, RBD striping might be an interesting solution for this.

The only theory I have relates to some testing I have been doing lately around favouring keeping inodes and dentrys in cache using vfs_cahce_pressure=1. Certainly here larger objects would help as with 16MB vs 4MB objects you would have 4x less inodes to cache.

To the OP, one thing that stands out is that in your 1^st test your average IO size is about 1.5MB and in the 2^nd test its 4kb. This would explain the difference in performance. Could you:

1.       Run IO stat on the client and confirm average IO size during the test
2.       Look in the /sys/block/rbd0/queue directory and see if the IO size settings haven’t limited themselves to something stupid.

Regards,
Alex

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com