Re: RBD Cache Considered Harmful? (on all-SSD pools, at least)

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Fri, 21 Nov 2014 08:47:52 -0600

On 11/21/2014 08:14 AM, Florian Haas wrote:
Hi everyone,

been trying to get to the bottom of this for a few days; thought I'd
take this to the list to see if someone had insight to share.

Situation: Ceph 0.87 (Giant) cluster with approx. 250 OSDs. One set of
OSD nodes with just spinners put into one CRUSH ruleset assigned to a
"spinner" pool, another set of OSD nodes with just SSDs put into another
ruleset, assigned to an "ssd" pool. Both pools use size 3. In the
default rados bench write (16 threads, 4MB object size), the spinner
pool gets about 500 MB/s throughput, the ssd pool gets about 850. All
relatively normal and what one would expect:

$ sudo rados -p spinner-test bench 30 write
  Maintaining 16 concurrent writes of 4194304 bytes for up to 30 seconds
or 0 objects
[...]
Total time run:         30.544917
Total writes made:      3858
Write size:             4194304
Bandwidth (MB/sec):     505.223
[...]
Average Latency:        0.126193

$ sudo rados -p ssd-test bench 30 write
  Maintaining 16 concurrent writes of 4194304 bytes for up to 30 seconds
or 0 objects
[...]
Total time run:         30.046918
Total writes made:      6410
Write size:             4194304
Bandwidth (MB/sec):     853.332
[...]
Average Latency:        0.0749883

So we see a bandwidth increase and a latency drop as we go from spinners
to SSDs (note: 80ms latency still isn't exactly great, but that's a
different discussion to have).

Now I'm trying to duplicate the rados bench results with rbd
bench-write. My assumption would be (and generally this assumption holds
true, in my experience) that when duplicating the rados bench parameters
with rbd bench-write, results should be *roughly* equivalent without RBD
caching, and slightly better with caching.

So here is the spinner pool, no caching:

$ sudo rbd -p spinner-test \
   --rbd_cache=false \
   --rbd_cache_writethrough_until_flush=false \
   bench-write rbdtest \
   --io-threads 16 \
   --io-size $((4<<20))
bench-write  io_size 4194304 io_threads 16 bytes 1073741824 pattern seq
   SEC       OPS   OPS/SEC   BYTES/SEC
     1       114    112.88  473443678.94
     2       156     77.91  326786116.90
     3       197     65.35  274116599.05
     4       240     59.57  249866261.64
elapsed:     4  ops:      256  ops/sec:    55.83  bytes/sec: 234159074.98

Throughput dropped from 500 MB/s (rados bench) to less than half of that
(rbd bench-write).

With caching (all cache related settings at their defaults, unless
overridden with --rbd_* args):

$ sudo rbd -p spinner-test \
   --rbd_cache=true \
   --rbd_cache_writethrough_until_flush=false \
   bench-write rbdtest \
   --io-threads 16 \
   --io-size $((4<<20))
bench-write  io_size 4194304 io_threads 16 bytes 1073741824 pattern seq
   SEC       OPS   OPS/SEC   BYTES/SEC
     1       126    110.44  463201062.29
     2       232    108.33  454353540.71
elapsed:     2  ops:      256  ops/sec:   105.97  bytes/sec: 444462860.84

So somewhat closer to what rados bench can do, but not nearly where
you'd expect to be.

And then for the ssd pool, things get weird. Here's rbd bench-write with
no caching:

$ sudo rbd -p ssd-test \
   --rbd_cache=false \
   --rbd_cache_writethrough_until_flush=false \
   bench-write rbdtest \
   --io-threads 16 \
   --io-size $((4<<20))
bench-write  io_size 4194304 io_threads 16 bytes 1073741824 pattern seq
   SEC       OPS   OPS/SEC   BYTES/SEC
     1       208    193.64  812202592.78
elapsed:     1  ops:      256  ops/sec:   202.14  bytes/sec: 847828574.27

850MB/s, which is what rados bench reports too. No overhead at all? That
would be nice. Let's write 4GB instead of 1GB:

$ sudo rbd -p ssd-test \
   --rbd_cache=false \
   --rbd_cache_writethrough_until_flush=false \
   bench-write rbdtest \
   --io-threads 16 \
   --io-size $((4<<20)) \
   --io-total $((4<<30))
   SEC       OPS   OPS/SEC   BYTES/SEC
     1       208    197.41  827983956.90
     2       416    207.91  872038511.36
     3       640    211.52  887162647.59
     4       864    213.98  897482175.07
elapsed:     4  ops:     1024  ops/sec:   216.39  bytes/sec: 907597866.21

Well, that's kinda nice, except it seems illogical that RBD would be
faster than RADOS, without caching. Let's turn caching on:

$ sudo rbd -p ssd-test \
   --rbd_cache=true \
   --rbd_cache_writethrough_until_flush=false \
   bench-write rbdtest \
   --io-threads 16 \
   --io-size $((4<<20))
bench-write  io_size 4194304 io_threads 16 bytes 1073741824 pattern seq
   SEC       OPS   OPS/SEC   BYTES/SEC
     1       152    141.46  593324418.90
elapsed:     1  ops:      256  ops/sec:   148.64  bytes/sec: 623450766.90

Oddly, we've dropped back to 620 MB/s. Try the 4GB total write for good
measure:

$ sudo rbd -p ssd-test \
   --rbd_cache=true \
   --rbd_cache_writethrough_until_flush=false \
   bench-write rbdtest \
   --io-threads 16 \
   --io-size $((4<<20)) \
   --io-total $((4<<30))
bench-write  io_size 4194304 io_threads 16 bytes 4294967296 pattern seq
   SEC       OPS   OPS/SEC   BYTES/SEC
     1       150    138.46  580729593.09
     2       302    145.23  609132960.16
     3       454    149.14  625522186.95
     4       606    151.10  633767869.87
     5       775    152.39  639175835.88
     6       927    153.49  643765109.80
elapsed:     6  ops:     1024  ops/sec:   154.11  bytes/sec: 646371592.30

So our average throughput went a little higher, except nowhere near
where we get without caching.

Now I am aware that there was a performance regression with librbd in
the Giant release (http://tracker.ceph.com/issues/9513) which has been
fixed in 0.88 and backported to the giant branch, but which hasn't seen
a Giant backport release. But as I understood it, that issue should only
affect reads, not writes.

So three questions:

(1) Are there any known *write* performance regressions in librbd in
giant vs. firefly which would cause rbd bench-write to trail so
significantly behind rados bench (on spinners)?

Hi Florian,

I don't really use rbd-bench (I tend to use fio with the librbd engine), 
but looking at your results above, it appears that the rbd-bench tests 
are only running for 1-4 seconds?  I think you need to run tests for at 
least a couple of minutes to get a feeling for what's going on.  So far 
I haven't really seen the kind of regression you are seeing, but that 
doesn't mean it doesn't exist!  Any chance you could see if anything 
unusual seems to be happening on the client or OSDs during the tests 
(High CPU usage, etc)

(2) Conversely, is there any logical explanation for rbd bench-write
being faster than rados bench (on SSDs), with identical parameters and
no RBD caching?

One thing we do see with rados bench is that a single process can 
bottleneck at high throughput rates.  We typically run concurrent copies 
of rados bench to get around this (either multiple client nodes or 
multiple copies of rados bench on one node).  Typically we see aggregate 
rados bench performance about 20% faster than RBD for sequential IO.

(3) Are there any known librbd cache performance issues in giant that
could explain this rather counter-intuitive behavior on all-spinner vs.
all-SSD pools?

I don't think there's anything that directly explains it, but there are 
some investigations going into librbd performance in general.

Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com