Re: NVMe's

André Gemünd <andre.gemuend@xxxxxxxxxxxxxxxxxx> · Wed, 23 Sep 2020 10:47:14 +0200 (CEST)

Hi Brent,

> 1.  If we do a jbod setup, the servers can hold 48 NVMes, if the servers
> were bought with 48 cores and 100+ GB of RAM, would this make sense?

Do you seriously mean 48 NVMes per server? How would you even come remotely close to supporting them with connection (to board) and network bandwidth?

Regarding some of your points there are some valuable comments by Mark Nelson in the archives, hopefully he is okay with me quoting them here, but of course better look them up in the archives for full context.

RAM with NVMe OSDs:

> So basically the answer is that how much memory you need depends largely
> on how much you care about performance, how many objects are present on
> an OSD, and how many objects (and how much data) you have in your active
> data set.  4GB is sort of our current default memory target per OSD, but
> as someone else mentioned bumping that up to 8-12GB per OSD might make
> sense for OSDs on large NVMe drives.  You can also lower that down to
> about 2GB before you start having real issues, but it definitely can
> have an impact on OSD performance.

CPUs with NVMe OSDs:

> With 10 NVMe drives per node, I'm guessing that a single EPYC 7451 is
> going to be CPU bound for small IO workloads (2.4c/4.8t per OSD), but
> will be network bound for large IO workloads unless you are sticking
> 2x100GbE in.  You might want to consider jumping up to the 7601.  That
> would get you closer to where you want to be for 10 NVMe drives
> (3.2c/6.4t per OSD).

Vitaly Filipov (I think) has also compiled interesting information in his wiki here: https://yourcmc.ru/wiki/Ceph_performance

Best Greetings
André

----- Am 23. Sep 2020 um 7:39 schrieb Brent Kennedy bkennedy@xxxxxxxxxx:

> We currently run a SSD cluster and HDD clusters and are looking at possibly
> creating a cluster for NVMe storage.  For spinners and SSDs, it seemed the
> max recommended per osd host server was 16 OSDs ( I know it depends on the
> CPUs and RAM, like 1 cpu core and 2GB memory ).
> 
> 
> 
> Questions:
> 1.  If we do a jbod setup, the servers can hold 48 NVMes, if the servers
> were bought with 48 cores and 100+ GB of RAM, would this make sense?
> 
> 2.  Should we just raid 5 by groups of NVMe drives instead ( and buy less
> CPU/RAM )?  There is a reluctance to waste even a single drive on raid
> because redundancy is basically cephs job.
> 3.  The plan was to build this with octopus ( hopefully there are no issues
> we should know about ).  Though I just saw one posted today, but this is a
> few months off.
> 
> 4.  Any feedback on max OSDs?
> 
> 5.  Right now they run 10Gb everywhere with 80Gb uplinks, I was thinking
> this would need at least 40Gb links to every node ( the hope is to use these
> to speed up image processing at the application layer locally in the DC ).
> I haven't spoken to the Dell engineers yet but my concern with NVMe is that
> the raid controller would end up being the bottleneck ( next in line after
> network connectivity ).

-- 
Dipl.-Inf. André Gemünd, Leiter IT / Head of IT
Fraunhofer-Institute for Algorithms and Scientific Computing
andre.gemuend@xxxxxxxxxxxxxxxxxx
Tel: +49 2241 14-2193
/C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx