On 11/21/2014 08:14 AM, Florian Haas wrote:
Hi everyone,
been trying to get to the bottom of this for a few days; thought I'd
take this to the list to see if someone had insight to share.
Situation: Ceph 0.87 (Giant) cluster with approx. 250 OSDs. One set of
OSD nodes with just spinners put into one CRUSH ruleset assigned to a
"spinner" pool, another set of OSD nodes with just SSDs put into another
ruleset, assigned to an "ssd" pool. Both pools use size 3. In the
default rados bench write (16 threads, 4MB object size), the spinner
pool gets about 500 MB/s throughput, the ssd pool gets about 850. All
relatively normal and what one would expect:
$ sudo rados -p spinner-test bench 30 write
Maintaining 16 concurrent writes of 4194304 bytes for up to 30 seconds
or 0 objects
[...]
Total time run: 30.544917
Total writes made: 3858
Write size: 4194304
Bandwidth (MB/sec): 505.223
[...]
Average Latency: 0.126193
$ sudo rados -p ssd-test bench 30 write
Maintaining 16 concurrent writes of 4194304 bytes for up to 30 seconds
or 0 objects
[...]
Total time run: 30.046918
Total writes made: 6410
Write size: 4194304
Bandwidth (MB/sec): 853.332
[...]
Average Latency: 0.0749883
So we see a bandwidth increase and a latency drop as we go from spinners
to SSDs (note: 80ms latency still isn't exactly great, but that's a
different discussion to have).
Now I'm trying to duplicate the rados bench results with rbd
bench-write. My assumption would be (and generally this assumption holds
true, in my experience) that when duplicating the rados bench parameters
with rbd bench-write, results should be *roughly* equivalent without RBD
caching, and slightly better with caching.
So here is the spinner pool, no caching:
$ sudo rbd -p spinner-test \
--rbd_cache=false \
--rbd_cache_writethrough_until_flush=false \
bench-write rbdtest \
--io-threads 16 \
--io-size $((4<<20))
bench-write io_size 4194304 io_threads 16 bytes 1073741824 pattern seq
SEC OPS OPS/SEC BYTES/SEC
1 114 112.88 473443678.94
2 156 77.91 326786116.90
3 197 65.35 274116599.05
4 240 59.57 249866261.64
elapsed: 4 ops: 256 ops/sec: 55.83 bytes/sec: 234159074.98
Throughput dropped from 500 MB/s (rados bench) to less than half of that
(rbd bench-write).
With caching (all cache related settings at their defaults, unless
overridden with --rbd_* args):
$ sudo rbd -p spinner-test \
--rbd_cache=true \
--rbd_cache_writethrough_until_flush=false \
bench-write rbdtest \
--io-threads 16 \
--io-size $((4<<20))
bench-write io_size 4194304 io_threads 16 bytes 1073741824 pattern seq
SEC OPS OPS/SEC BYTES/SEC
1 126 110.44 463201062.29
2 232 108.33 454353540.71
elapsed: 2 ops: 256 ops/sec: 105.97 bytes/sec: 444462860.84
So somewhat closer to what rados bench can do, but not nearly where
you'd expect to be.
And then for the ssd pool, things get weird. Here's rbd bench-write with
no caching:
$ sudo rbd -p ssd-test \
--rbd_cache=false \
--rbd_cache_writethrough_until_flush=false \
bench-write rbdtest \
--io-threads 16 \
--io-size $((4<<20))
bench-write io_size 4194304 io_threads 16 bytes 1073741824 pattern seq
SEC OPS OPS/SEC BYTES/SEC
1 208 193.64 812202592.78
elapsed: 1 ops: 256 ops/sec: 202.14 bytes/sec: 847828574.27
850MB/s, which is what rados bench reports too. No overhead at all? That
would be nice. Let's write 4GB instead of 1GB:
$ sudo rbd -p ssd-test \
--rbd_cache=false \
--rbd_cache_writethrough_until_flush=false \
bench-write rbdtest \
--io-threads 16 \
--io-size $((4<<20)) \
--io-total $((4<<30))
SEC OPS OPS/SEC BYTES/SEC
1 208 197.41 827983956.90
2 416 207.91 872038511.36
3 640 211.52 887162647.59
4 864 213.98 897482175.07
elapsed: 4 ops: 1024 ops/sec: 216.39 bytes/sec: 907597866.21
Well, that's kinda nice, except it seems illogical that RBD would be
faster than RADOS, without caching. Let's turn caching on:
$ sudo rbd -p ssd-test \
--rbd_cache=true \
--rbd_cache_writethrough_until_flush=false \
bench-write rbdtest \
--io-threads 16 \
--io-size $((4<<20))
bench-write io_size 4194304 io_threads 16 bytes 1073741824 pattern seq
SEC OPS OPS/SEC BYTES/SEC
1 152 141.46 593324418.90
elapsed: 1 ops: 256 ops/sec: 148.64 bytes/sec: 623450766.90
Oddly, we've dropped back to 620 MB/s. Try the 4GB total write for good
measure:
$ sudo rbd -p ssd-test \
--rbd_cache=true \
--rbd_cache_writethrough_until_flush=false \
bench-write rbdtest \
--io-threads 16 \
--io-size $((4<<20)) \
--io-total $((4<<30))
bench-write io_size 4194304 io_threads 16 bytes 4294967296 pattern seq
SEC OPS OPS/SEC BYTES/SEC
1 150 138.46 580729593.09
2 302 145.23 609132960.16
3 454 149.14 625522186.95
4 606 151.10 633767869.87
5 775 152.39 639175835.88
6 927 153.49 643765109.80
elapsed: 6 ops: 1024 ops/sec: 154.11 bytes/sec: 646371592.30
So our average throughput went a little higher, except nowhere near
where we get without caching.
Now I am aware that there was a performance regression with librbd in
the Giant release (http://tracker.ceph.com/issues/9513) which has been
fixed in 0.88 and backported to the giant branch, but which hasn't seen
a Giant backport release. But as I understood it, that issue should only
affect reads, not writes.
So three questions:
(1) Are there any known *write* performance regressions in librbd in
giant vs. firefly which would cause rbd bench-write to trail so
significantly behind rados bench (on spinners)?
Hi Florian,
I don't really use rbd-bench (I tend to use fio with the librbd engine),
but looking at your results above, it appears that the rbd-bench tests
are only running for 1-4 seconds? I think you need to run tests for at
least a couple of minutes to get a feeling for what's going on. So far
I haven't really seen the kind of regression you are seeing, but that
doesn't mean it doesn't exist! Any chance you could see if anything
unusual seems to be happening on the client or OSDs during the tests
(High CPU usage, etc)
(2) Conversely, is there any logical explanation for rbd bench-write
being faster than rados bench (on SSDs), with identical parameters and
no RBD caching?
One thing we do see with rados bench is that a single process can
bottleneck at high throughput rates. We typically run concurrent copies
of rados bench to get around this (either multiple client nodes or
multiple copies of rados bench on one node). Typically we see aggregate
rados bench performance about 20% faster than RBD for sequential IO.
(3) Are there any known librbd cache performance issues in giant that
could explain this rather counter-intuitive behavior on all-spinner vs.
all-SSD pools?
I don't think there's anything that directly explains it, but there are
some investigations going into librbd performance in general.
Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com