Re: NVMe's

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




I've just finishing doing our own benchmarking, and I can say, you want to do something very unbalanced and CPU bounded.

1. Ceph consume a LOT of CPU. My peak value was around 500% CPU per ceph-osd at top-performance (see the recent thread on 'ceph on brd') with more realistic numbers around 300-400% CPU per device. 2. Ceph is unable to deliver more than 12k IOPS per ceph-osd (may be a little more with top-tier low-core high-frequency CPU, but not much). So, super-duper-nvme wont make difference. (btw, I have a stupid idea to try to run two ceph-osd from the same LV with a single PV underneath VG, but it not tested). 3. You wll find that any given client performance is heavily limited by sum of all RTT in the network, plus own latencies of ceph, so very fast NVME give a diminishing return. 4. CPU bounded ceph-osd completely wipe any differences for underlying devices (except for desktop-class crawlers).

You can run your own tests, even without fancy 48-nvme boxes - just run ceph-osd on brd (block ram disk). ceph-osd won't run any faster on anything else (ramdisk is the fastest), so numbers you get from brd is supremum (upper bound) for theoretical performance.

Given max 400-500% CPU per ceph-osd I'd say you need to keep number of NVME in server below 12, or, 15 (but sometimes you'll get CPU saturation).

In my opinion less fancy boxes with smaller number of drives per server (but larger number of servers) would make your (or your operation team's) life much less stressful.

NEVER ever use raid with ceph.


On 23/09/2020 08:39, Brent Kennedy wrote:
We currently run a SSD cluster and HDD clusters and are looking at possibly
creating a cluster for NVMe storage.  For spinners and SSDs, it seemed the
max recommended per osd host server was 16 OSDs ( I know it depends on the
CPUs and RAM, like 1 cpu core and 2GB memory ).

Questions:
1.  If we do a jbod setup, the servers can hold 48 NVMes, if the servers
were bought with 48 cores and 100+ GB of RAM, would this make sense?

2.  Should we just raid 5 by groups of NVMe drives instead ( and buy less
CPU/RAM )?  There is a reluctance to waste even a single drive on raid
because redundancy is basically cephs job.
3.  The plan was to build this with octopus ( hopefully there are no issues
we should know about ).  Though I just saw one posted today, but this is a
few months off.

4.  Any feedback on max OSDs?

5.  Right now they run 10Gb everywhere with 80Gb uplinks, I was thinking
this would need at least 40Gb links to every node ( the hope is to use these
to speed up image processing at the application layer locally in the DC ).
I haven't spoken to the Dell engineers yet but my concern with NVMe is that
the raid controller would end up being the bottleneck ( next in line after
network connectivity ).

Regards,

-Brent

Existing Clusters:

Test: Nautilus 14.2.11 with 3 osd servers, 1 mon/man, 1 gateway, 2 iscsi
gateways ( all virtual on nvme )

US Production(HDD): Nautilus 14.2.11 with 12 osd servers, 3 mons, 4
gateways, 2 iscsi gateways

UK Production(HDD): Nautilus 14.2.11 with 12 osd servers, 3 mons, 4 gateways

US Production(SSD): Nautilus 14.2.11 with 6 osd servers, 3 mons, 3 gateways,
2 iscsi gateways

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux