Re: NVMe's

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Wed, 23 Sep 2020 23:58:57 +0200

On 23/09/2020 17:58, vitalif@xxxxxxxxxx wrote:
I have no idea how you get 66k write iops with one OSD )

I've just repeated a test by creating a test pool on one NVMe OSD with 8 PGs (all pinned to the same OSD with pg-upmap). Then I ran 4x fio randwrite q128 over 4 RBD images. I got 17k iops.

8 PGs is a low number, there will be a lot of PG lock contention across 
your 4x 128 queue depth.

OK, in fact that's not the worst result for Ceph, but problem is that I only get 30k write iops when benchmarking 4 RBD images spread over all OSDs _in_the_same_cluster_. And there are 14 of them.

I've just finishing doing our own benchmarking, and I can say, you
want to do something very unbalanced and CPU bounded.

1. Ceph consume a LOT of CPU. My peak value was around 500% CPU per
ceph-osd at top-performance (see the recent thread on 'ceph on brd')
with more realistic numbers around 300-400% CPU per device.
In fact in isolation on the test setup that Intel donated for
community ceph R&D we've pushed a single OSD to consume around 1400%
CPU at 80K write IOPS! :)  I agree though, we typical see a peak of
about 500-600% CPU per OSD on multi-node clusters with a
correspondingly lower write throughput.  I do believe that in some
cases the mix of IO we are doing is causing us to at least be
partially bound by disk write latency with the single writer thread
in the rocksdb WAL though.
I'd really like to see how they done this without offloading (their
configuration).
I went back and looked over some of the old results. I didn't find the
really high test scores (and now that I'm thinking about it they may
have been from when I was ripping out pglog OMAP updates!), but here's
one example I did find from earlier testing last winter that at least
got into roughly the right ballpark with stock master from last December
(~66K IOPS):

Avg 4K FIO randwrite IOPS: 65841.7

- 1 p4510 NVMe backed OSD

- 8GB osd memory target

- 4K min alloc size

- 4 clients, 1 128GB RBD volume per client, io_depth=128, time=300s

- 128 PGs (fixed)

- latency-network tuned profile

- bluestore_rocksdb_options =
"compression=kNoCompression,max_total_wal_size=1073741824,max_write_buffer_number=16,min_write_buffe
_number_to_merge=3,recycle_log_file_num=4,write_buffer_size=67108864,writable_file_max_buffer_size=0
compaction_readahead_size=2097152,max_background_compactions=2,compaction_style=kCompactionStyleUniv
rsal"

- bluestore_default_buffered_write = true

- bluestore_default_buffered_read = true

- rbd cache = false

Beyond that general stuff like background scrubbing and pg autoscaling
was disabled.  I should note that these results are using universal
compaction in rocksdb which you probably don't want to do in production
because it can require 2x the total DB space to perform a compaction.
It might actually be feasible now that we are doing column family
sharding thanks to Adam's PR because you will only need 2x the space of
any individual column family for compaction rather than the whole DB,
but it's still unsupported for now.

Mark

2. Ceph is unable to deliver more than 12k IOPS per ceph-osd (may be
a little more with top-tier low-core high-frequency CPU, but not
much). So, super-duper-nvme wont make difference. (btw, I have a
stupid idea to try to run two ceph-osd from the same LV with a
single PV underneath VG, but it not tested).
I'm curious if you've tried octopus+ yet?  We refactored bluestore's
caches which internally has proven to help quite a bit with latency
bound workloads as it reduces lock contention in onode cache shards
and the impact of cache trimming (no more single trimming trim thread
constantly grabbing the lock for long periods of time!).  In a 64
NVMe drive setup (P4510s), we were able to do a little north of 400K
write IOPS with 3x replication, so about 19K IOPs per OSD once you
factor rep in. Also, in Nautilus you can see real benefits wtih
running multiple OSDs on a single device but with Octopus and master
we've pretty much closed the gap on our test setup:
It's octopus. I was doing single-osd benchmark, removing all movable
parts (brd instead of nvme, no network, size=1, etc). Moreover, I've
focused on rados benchmark, as RBD is just a derivative from rados
performance.

Anyway, big thank you for input.

https://docs.google.com/spreadsheets/d/1e5eTeHdZnSizoY6AUjH0knb4jTCW7KMU4RoryLX9EHQ/edit?usp=sharing

Generally speaking using the latency-performance or latency-network
tuned profiles helps (mostly due to avoid C state CPU transitions) as
does higher clock speeds.  Not using replication helps but that's
obviously not a realistic solution for most people. :)
I used size=1 and 'no ssd, no network' as upper bound. If allows to
find limits for ceph-osd performance. Any real-life things
(replication, network, real block devices) will make things worse, not
better. Knowing upper performance bound is really nice when start to
choose server configuration.

3. You wll find that any given client performance is heavily limited
by sum of all RTT in the network, plus own latencies of ceph, so
very fast NVME give a diminishing return.
4. CPU bounded ceph-osd completely wipe any differences for
underlying devices (except for desktop-class crawlers).

You can run your own tests, even without fancy 48-nvme boxes - just
run ceph-osd on brd (block ram disk). ceph-osd won't run any faster
on anything else (ramdisk is the fastest), so numbers you get from
brd is supremum (upper bound) for theoretical performance.

Given max 400-500% CPU per ceph-osd I'd say you need to keep number
of NVME in server below 12, or, 15 (but sometimes you'll get CPU
saturation).

In my opinion less fancy boxes with smaller number of drives per
server (but larger number of servers) would make your (or your
operation team's) life much less stressful.
That's pretty much the advice I've been giving people since the
Inktank days.  It costs more and is lower density, but the design is
simpler, you are less likely to under provision CPU, less likely to
run into memory bandwidth bottlenecks, and you have less recovery to
do when a node fails.  Especially now with how many NVMe drives you
can fit in a single 1U server!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx