Re: NVMe's

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 23 Sep 2020 07:50:35 -0500

On 9/23/20 5:41 AM, George Shuklin wrote:

I've just finishing doing our own benchmarking, and I can say, you 
want to do something very unbalanced and CPU bounded.

1. Ceph consume a LOT of CPU. My peak value was around 500% CPU per 
ceph-osd at top-performance (see the recent thread on 'ceph on brd') 
with more realistic numbers around 300-400% CPU per device.

In fact in isolation on the test setup that Intel donated for community 
ceph R&D we've pushed a single OSD to consume around 1400% CPU at 80K 
write IOPS! :)  I agree though, we typical see a peak of about 500-600% 
CPU per OSD on multi-node clusters with a correspondingly lower write 
throughput.  I do believe that in some cases the mix of IO we are doing 
is causing us to at least be partially bound by disk write latency with 
the single writer thread in the rocksdb WAL though.

2. Ceph is unable to deliver more than 12k IOPS per ceph-osd (may be a 
little more with top-tier low-core high-frequency CPU, but not much). 
So, super-duper-nvme wont make difference. (btw, I have a stupid idea 
to try to run two ceph-osd from the same LV with a single PV 
underneath VG, but it not tested).

I'm curious if you've tried octopus+ yet?  We refactored bluestore's 
caches which internally has proven to help quite a bit with latency 
bound workloads as it reduces lock contention in onode cache shards and 
the impact of cache trimming (no more single trimming trim thread 
constantly grabbing the lock for long periods of time!).  In a 64 NVMe 
drive setup (P4510s), we were able to do a little north of 400K write 
IOPS with 3x replication, so about 19K IOPs per OSD once you factor rep 
in.  Also, in Nautilus you can see real benefits wtih running multiple 
OSDs on a single device but with Octopus and master we've pretty much 
closed the gap on our test setup:

https://docs.google.com/spreadsheets/d/1e5eTeHdZnSizoY6AUjH0knb4jTCW7KMU4RoryLX9EHQ/edit?usp=sharing

Generally speaking using the latency-performance or latency-network 
tuned profiles helps (mostly due to avoid C state CPU transitions) as 
does higher clock speeds.  Not using replication helps but that's 
obviously not a realistic solution for most people. :)

3. You wll find that any given client performance is heavily limited 
by sum of all RTT in the network, plus own latencies of ceph, so very 
fast NVME give a diminishing return.
4. CPU bounded ceph-osd completely wipe any differences for underlying 
devices (except for desktop-class crawlers).

You can run your own tests, even without fancy 48-nvme boxes - just 
run ceph-osd on brd (block ram disk). ceph-osd won't run any faster on 
anything else (ramdisk is the fastest), so numbers you get from brd is 
supremum (upper bound) for theoretical performance.

Given max 400-500% CPU per ceph-osd I'd say you need to keep number of 
NVME in server below 12, or, 15 (but sometimes you'll get CPU 
saturation).

In my opinion less fancy boxes with smaller number of drives per 
server (but larger number of servers) would make your (or your 
operation team's) life much less stressful.

That's pretty much the advice I've been giving people since the Inktank 
days.  It costs more and is lower density, but the design is simpler, 
you are less likely to under provision CPU, less likely to run into 
memory bandwidth bottlenecks, and you have less recovery to do when a 
node fails.  Especially now with how many NVMe drives you can fit in a 
single 1U server!

NEVER ever use raid with ceph.

NEVER is a strong word.  There are some specialized products other there 
that do raid behind the scenes fairly quickly.  In very specific cases 
you might consider a solution with very fast RAID6 backed OSDs and 2X 
replication, but generally speaking I agree that simpler is better 
especially if you are doing it yourself.

Mark

On 23/09/2020 08:39, Brent Kennedy wrote:
We currently run a SSD cluster and HDD clusters and are looking at 
possibly
creating a cluster for NVMe storage.  For spinners and SSDs, it 
seemed the
max recommended per osd host server was 16 OSDs ( I know it depends 
on the
CPUs and RAM, like 1 cpu core and 2GB memory ).

Questions:
1.  If we do a jbod setup, the servers can hold 48 NVMes, if the servers
were bought with 48 cores and 100+ GB of RAM, would this make sense?

2.  Should we just raid 5 by groups of NVMe drives instead ( and buy 
less
CPU/RAM )?  There is a reluctance to waste even a single drive on raid
because redundancy is basically cephs job.
3.  The plan was to build this with octopus ( hopefully there are no 
issues
we should know about ).  Though I just saw one posted today, but this 
is a
few months off.

4.  Any feedback on max OSDs?

5.  Right now they run 10Gb everywhere with 80Gb uplinks, I was thinking
this would need at least 40Gb links to every node ( the hope is to 
use these
to speed up image processing at the application layer locally in the 
DC ).
I haven't spoken to the Dell engineers yet but my concern with NVMe 
is that
the raid controller would end up being the bottleneck ( next in line 
after
network connectivity ).

Regards,

-Brent

Existing Clusters:

Test: Nautilus 14.2.11 with 3 osd servers, 1 mon/man, 1 gateway, 2 iscsi
gateways ( all virtual on nvme )

US Production(HDD): Nautilus 14.2.11 with 12 osd servers, 3 mons, 4
gateways, 2 iscsi gateways

UK Production(HDD): Nautilus 14.2.11 with 12 osd servers, 3 mons, 4 
gateways

US Production(SSD): Nautilus 14.2.11 with 6 osd servers, 3 mons, 3 
gateways,
2 iscsi gateways

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx