On Mon, Nov 28, 2016 at 2:59 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
On Mon, Nov 28, 2016 at 6:20 PM, Francois Blondel <fblondel@xxxxxxxxxxxx> wrote:
> Hi *,
>
> I am currently testing different scenarios to try to optimize sequential
> read and write speeds using Kernel RBD.
>
> I have two block devices created with :
> rbd create block1 --size 500G --pool rbd --image-feature layering
> rbd create block132m --size 500G --pool rbd --image-feature layering
> --object-size 32M
>
> -> Writing to block1 works quite fine (about 200ops/s, 310MB/s in average,
> for a 250GB file) (tests running with dd)
> -> Writing to block132m is much slower (about 40MB/s in average), and
> generates high ops/s (seen from a ceph -w) (from 4000 to 13000)
>
> Current test cluster:
>
> health HEALTH_WARN
> noscrub,nodeep-scrub,sortbitwise flag(s) set
> monmap e2: 3 mons at
> {aac=10.113.49.48:6789/0,aad=10.112.33.36:6789/0,aae=10.112.48.60:6789/0}
> election epoch 26, quorum 0,1,2 aad,aae,aac
> osdmap e10962: 38 osds: 38 up, 38 in
> flags noscrub,nodeep-scrub,sortbitwise
> pgmap v120464: 1024 pgs, 1 pools, 486 GB data, 122 kobjects
> 4245 GB used, 50571 GB / 54816 GB avail
> 1024 active+clean
>
> The OSDs (using bluestore) have been created using:
> ceph-disk prepare --zap-disk --bluestore --cluster ceph --cluster-uuid
> XX..XX /dev/sdX
>
> ceph -v : ceph version 10.2.3
> (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
>
>
> Does someone have any experience involving "non-standard" RBD "object-size"
> ?
>
> Could this be due to "bluestore", or has someone already encountered that
> issue using "filestore" OSDs ?
It's hard to tell without any additional information: dd command,
iostat or blktrace, probably some OSD logs as well.
A ton of work has gone into bluestore in kraken, mostly on the
performance front - jewel bluestore has little in common with the
current version.
>
> Should switching to an higher RBD "object-size" at least theorycally improve
> seq r/w speeds ?
Well, it really depends on the workload. It may result in an
improvement in certain cases, but there are many downsides - RADOS (be
it with filestore or bluestore) works much better with smaller objects.
I agree with Jason in that you are probably better off with the
default. Try experimenting with krbd readahead - bump it to 4M or 8M
or even higher and make sure you have a recent kernel on the client
machine (4.4 or newer).
There were a number of threads on this subject on ceph-users. Search
for: single thread sequential kernel rbd readahead, or so.
Thanks,
Ilya
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Our experience on a busy production cluster are showing better read and write latency response with 16MB objects. With 4MB objects we were seeing sometimes 20 second delays, with 16MB it is more like 5 seconds at most. There are a few caveats to our current cluster:
- it is made of about 200 NL-SAS 4TB HDDs with Micron SSDs as journals. I have been told that on the 7.2k rpm drives latency jumps after about 200 iops per spindle.
- our workload is 100% VMWare VMs running replicated databases. Now with NfS, but likely still a lot of small IO.
I wonder if we are a corner case. But with 16 MB objects with both iSCSI gateway, as well as NFS we saw a clear improvement in latency and throughput. I will reach to our performance engineer if anyone is interested in details of tests.
Any thoughts on why this is the case? Nick Fisk thought that maybe the thin space allocation overhead was smaller with larger object sizes?
Regards,
Alex
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com