Re: NVMe's

"Brent Kennedy" <bkennedy@xxxxxxxxxx> · Wed, 23 Sep 2020 13:59:08 -0400

Thanks for the feedback everyone!  It seems we have more to look into regarding NVMe enterprise storage solutions.  The workload doesn’t demand NVMe performance, so SSD seems to be the most cost effective way to handle this.  The performance discussion is very interesting!

Regards,
Brent

-----Original Message-----
From: Stefan Kooman <stefan@xxxxxx> 
Sent: Wednesday, September 23, 2020 3:49 AM
To: Brent Kennedy <bkennedy@xxxxxxxxxx>; 'ceph-users' <ceph-users@xxxxxxx>
Subject: Re:  NVMe's

On 2020-09-23 07:39, Brent Kennedy wrote:
> We currently run a SSD cluster and HDD clusters and are looking at 
> possibly creating a cluster for NVMe storage.  For spinners and SSDs, 
> it seemed the max recommended per osd host server was 16 OSDs ( I know 
> it depends on the CPUs and RAM, like 1 cpu core and 2GB memory ).
> 
>  
> 
> Questions: 
> 1.  If we do a jbod setup, the servers can hold 48 NVMes, if the 
> servers were bought with 48 cores and 100+ GB of RAM, would this make sense?

As always ... it depends :-). But I would not recommend it. For NVMe you want to use more like 10 GB per OSD (osd memory target) and have some spare RAM for buffer cache. The amount of CPU would be sufficient for normal use, but might not be enough when in a recovery situation / RocksDB housekeeping etc. But it depends on what Ceph features you want to use (RBD won't use much OMAP/META, so you would be OK with that use case).

> 
> 2.  Should we just raid 5 by groups of NVMe drives instead ( and buy 
> less CPU/RAM )?  There is a reluctance to waste even a single drive on 
> raid because redundancy is basically cephs job.

Yeah, let Ceph handle to redundancy. You don't want to use hardware raid controllers.

> 3.  The plan was to build this with octopus ( hopefully there are no 
> issues we should know about ).  Though I just saw one posted today, 
> but this is a few months off.

Should be OK, especially for new clusters. Test, test, test.

> 
> 4.  Any feedback on max OSDs?

I would recommend like 10 NVMe per server. More nodes is always better than more dense nodes from a performance perspective, has less impact when one node fails. The more nodes the less impact when one node fails, faster recovery / backfill, etc.

> 5.  Right now they run 10Gb everywhere with 80Gb uplinks, I was 
> thinking this would need at least 40Gb links to every node ( the hope 
> is to use these to speed up image processing at the application layer locally in the DC ).

Do you want to be able to fully utilize all NVMe regarding throughput?
That will be an issue. You will be limited by bandwith to backfill those OSDs (especially if you need to backfill a whole node at once).

> I haven't spoken to the Dell engineers yet but my concern with NVMe is 
> that the raid controller would end up being the bottleneck ( next in 
> line after network connectivity ).

Most probably, yes, plus increased latency. My standpoint is to not use hardware raidcontrollers for NVMe storage.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx