Re: NVMe's

George Shuklin <george.shuklin@xxxxxxxxx> · Wed, 23 Sep 2020 16:23:20 +0300

I've just finishing doing our own benchmarking, and I can say, you 
want to do something very unbalanced and CPU bounded.

1. Ceph consume a LOT of CPU. My peak value was around 500% CPU per 
ceph-osd at top-performance (see the recent thread on 'ceph on brd') 
with more realistic numbers around 300-400% CPU per device.

In fact in isolation on the test setup that Intel donated for 
community ceph R&D we've pushed a single OSD to consume around 1400% 
CPU at 80K write IOPS! :)  I agree though, we typical see a peak of 
about 500-600% CPU per OSD on multi-node clusters with a 
correspondingly lower write throughput.  I do believe that in some 
cases the mix of IO we are doing is causing us to at least be 
partially bound by disk write latency with the single writer thread in 
the rocksdb WAL though.

I'd really like to see how they done this without offloading (their 
configuration).

2. Ceph is unable to deliver more than 12k IOPS per ceph-osd (may be 
a little more with top-tier low-core high-frequency CPU, but not 
much). So, super-duper-nvme wont make difference. (btw, I have a 
stupid idea to try to run two ceph-osd from the same LV with a single 
PV underneath VG, but it not tested).

I'm curious if you've tried octopus+ yet?  We refactored bluestore's 
caches which internally has proven to help quite a bit with latency 
bound workloads as it reduces lock contention in onode cache shards 
and the impact of cache trimming (no more single trimming trim thread 
constantly grabbing the lock for long periods of time!).  In a 64 NVMe 
drive setup (P4510s), we were able to do a little north of 400K write 
IOPS with 3x replication, so about 19K IOPs per OSD once you factor 
rep in.  Also, in Nautilus you can see real benefits wtih running 
multiple OSDs on a single device but with Octopus and master we've 
pretty much closed the gap on our test setup:

It's octopus. I was doing single-osd benchmark, removing all movable 
parts (brd instead of nvme, no network, size=1, etc). Moreover, I've 
focused on rados benchmark, as RBD is just a derivative from rados 
performance.

Anyway, big thank you for input.

https://docs.google.com/spreadsheets/d/1e5eTeHdZnSizoY6AUjH0knb4jTCW7KMU4RoryLX9EHQ/edit?usp=sharing 

Generally speaking using the latency-performance or latency-network 
tuned profiles helps (mostly due to avoid C state CPU transitions) as 
does higher clock speeds.  Not using replication helps but that's 
obviously not a realistic solution for most people. :)

I used size=1 and 'no ssd, no network' as upper bound. If allows to find 
limits for ceph-osd performance. Any real-life things (replication, 
network, real block devices) will make things worse, not better. Knowing 
upper performance bound is really nice when start to choose server 
configuration.

3. You wll find that any given client performance is heavily limited 
by sum of all RTT in the network, plus own latencies of ceph, so very 
fast NVME give a diminishing return.
4. CPU bounded ceph-osd completely wipe any differences for 
underlying devices (except for desktop-class crawlers).

You can run your own tests, even without fancy 48-nvme boxes - just 
run ceph-osd on brd (block ram disk). ceph-osd won't run any faster 
on anything else (ramdisk is the fastest), so numbers you get from 
brd is supremum (upper bound) for theoretical performance.

Given max 400-500% CPU per ceph-osd I'd say you need to keep number 
of NVME in server below 12, or, 15 (but sometimes you'll get CPU 
saturation).

In my opinion less fancy boxes with smaller number of drives per 
server (but larger number of servers) would make your (or your 
operation team's) life much less stressful.

That's pretty much the advice I've been giving people since the 
Inktank days.  It costs more and is lower density, but the design is 
simpler, you are less likely to under provision CPU, less likely to 
run into memory bandwidth bottlenecks, and you have less recovery to 
do when a node fails.  Especially now with how many NVMe drives you 
can fit in a single 1U server!

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx