Hi everyone, been trying to get to the bottom of this for a few days; thought I'd take this to the list to see if someone had insight to share. Situation: Ceph 0.87 (Giant) cluster with approx. 250 OSDs. One set of OSD nodes with just spinners put into one CRUSH ruleset assigned to a "spinner" pool, another set of OSD nodes with just SSDs put into another ruleset, assigned to an "ssd" pool. Both pools use size 3. In the default rados bench write (16 threads, 4MB object size), the spinner pool gets about 500 MB/s throughput, the ssd pool gets about 850. All relatively normal and what one would expect: $ sudo rados -p spinner-test bench 30 write Maintaining 16 concurrent writes of 4194304 bytes for up to 30 seconds or 0 objects [...] Total time run: 30.544917 Total writes made: 3858 Write size: 4194304 Bandwidth (MB/sec): 505.223 [...] Average Latency: 0.126193 $ sudo rados -p ssd-test bench 30 write Maintaining 16 concurrent writes of 4194304 bytes for up to 30 seconds or 0 objects [...] Total time run: 30.046918 Total writes made: 6410 Write size: 4194304 Bandwidth (MB/sec): 853.332 [...] Average Latency: 0.0749883 So we see a bandwidth increase and a latency drop as we go from spinners to SSDs (note: 80ms latency still isn't exactly great, but that's a different discussion to have). Now I'm trying to duplicate the rados bench results with rbd bench-write. My assumption would be (and generally this assumption holds true, in my experience) that when duplicating the rados bench parameters with rbd bench-write, results should be *roughly* equivalent without RBD caching, and slightly better with caching. So here is the spinner pool, no caching: $ sudo rbd -p spinner-test \ --rbd_cache=false \ --rbd_cache_writethrough_until_flush=false \ bench-write rbdtest \ --io-threads 16 \ --io-size $((4<<20)) bench-write io_size 4194304 io_threads 16 bytes 1073741824 pattern seq SEC OPS OPS/SEC BYTES/SEC 1 114 112.88 473443678.94 2 156 77.91 326786116.90 3 197 65.35 274116599.05 4 240 59.57 249866261.64 elapsed: 4 ops: 256 ops/sec: 55.83 bytes/sec: 234159074.98 Throughput dropped from 500 MB/s (rados bench) to less than half of that (rbd bench-write). With caching (all cache related settings at their defaults, unless overridden with --rbd_* args): $ sudo rbd -p spinner-test \ --rbd_cache=true \ --rbd_cache_writethrough_until_flush=false \ bench-write rbdtest \ --io-threads 16 \ --io-size $((4<<20)) bench-write io_size 4194304 io_threads 16 bytes 1073741824 pattern seq SEC OPS OPS/SEC BYTES/SEC 1 126 110.44 463201062.29 2 232 108.33 454353540.71 elapsed: 2 ops: 256 ops/sec: 105.97 bytes/sec: 444462860.84 So somewhat closer to what rados bench can do, but not nearly where you'd expect to be. And then for the ssd pool, things get weird. Here's rbd bench-write with no caching: $ sudo rbd -p ssd-test \ --rbd_cache=false \ --rbd_cache_writethrough_until_flush=false \ bench-write rbdtest \ --io-threads 16 \ --io-size $((4<<20)) bench-write io_size 4194304 io_threads 16 bytes 1073741824 pattern seq SEC OPS OPS/SEC BYTES/SEC 1 208 193.64 812202592.78 elapsed: 1 ops: 256 ops/sec: 202.14 bytes/sec: 847828574.27 850MB/s, which is what rados bench reports too. No overhead at all? That would be nice. Let's write 4GB instead of 1GB: $ sudo rbd -p ssd-test \ --rbd_cache=false \ --rbd_cache_writethrough_until_flush=false \ bench-write rbdtest \ --io-threads 16 \ --io-size $((4<<20)) \ --io-total $((4<<30)) SEC OPS OPS/SEC BYTES/SEC 1 208 197.41 827983956.90 2 416 207.91 872038511.36 3 640 211.52 887162647.59 4 864 213.98 897482175.07 elapsed: 4 ops: 1024 ops/sec: 216.39 bytes/sec: 907597866.21 Well, that's kinda nice, except it seems illogical that RBD would be faster than RADOS, without caching. Let's turn caching on: $ sudo rbd -p ssd-test \ --rbd_cache=true \ --rbd_cache_writethrough_until_flush=false \ bench-write rbdtest \ --io-threads 16 \ --io-size $((4<<20)) bench-write io_size 4194304 io_threads 16 bytes 1073741824 pattern seq SEC OPS OPS/SEC BYTES/SEC 1 152 141.46 593324418.90 elapsed: 1 ops: 256 ops/sec: 148.64 bytes/sec: 623450766.90 Oddly, we've dropped back to 620 MB/s. Try the 4GB total write for good measure: $ sudo rbd -p ssd-test \ --rbd_cache=true \ --rbd_cache_writethrough_until_flush=false \ bench-write rbdtest \ --io-threads 16 \ --io-size $((4<<20)) \ --io-total $((4<<30)) bench-write io_size 4194304 io_threads 16 bytes 4294967296 pattern seq SEC OPS OPS/SEC BYTES/SEC 1 150 138.46 580729593.09 2 302 145.23 609132960.16 3 454 149.14 625522186.95 4 606 151.10 633767869.87 5 775 152.39 639175835.88 6 927 153.49 643765109.80 elapsed: 6 ops: 1024 ops/sec: 154.11 bytes/sec: 646371592.30 So our average throughput went a little higher, except nowhere near where we get without caching. Now I am aware that there was a performance regression with librbd in the Giant release (http://tracker.ceph.com/issues/9513) which has been fixed in 0.88 and backported to the giant branch, but which hasn't seen a Giant backport release. But as I understood it, that issue should only affect reads, not writes. So three questions: (1) Are there any known *write* performance regressions in librbd in giant vs. firefly which would cause rbd bench-write to trail so significantly behind rados bench (on spinners)? (2) Conversely, is there any logical explanation for rbd bench-write being faster than rados bench (on SSDs), with identical parameters and no RBD caching? (3) Are there any known librbd cache performance issues in giant that could explain this rather counter-intuitive behavior on all-spinner vs. all-SSD pools? Cheers, Florian _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com