Re: NVMe's

Stefan Kooman <stefan@xxxxxx> · Wed, 23 Sep 2020 09:48:37 +0200

On 2020-09-23 07:39, Brent Kennedy wrote:
> We currently run a SSD cluster and HDD clusters and are looking at possibly
> creating a cluster for NVMe storage.  For spinners and SSDs, it seemed the
> max recommended per osd host server was 16 OSDs ( I know it depends on the
> CPUs and RAM, like 1 cpu core and 2GB memory ).  
> 
>  
> 
> Questions: 
> 1.  If we do a jbod setup, the servers can hold 48 NVMes, if the servers
> were bought with 48 cores and 100+ GB of RAM, would this make sense?

As always ... it depends :-). But I would not recommend it. For NVMe you
want to use more like 10 GB per OSD (osd memory target) and have some
spare RAM for buffer cache. The amount of CPU would be sufficient for
normal use, but might not be enough when in a recovery situation /
RocksDB housekeeping etc. But it depends on what Ceph features you want
to use (RBD won't use much OMAP/META, so you would be OK with that use
case).

> 
> 2.  Should we just raid 5 by groups of NVMe drives instead ( and buy less
> CPU/RAM )?  There is a reluctance to waste even a single drive on raid
> because redundancy is basically cephs job.

Yeah, let Ceph handle to redundancy. You don't want to use hardware raid
controllers.

> 3.  The plan was to build this with octopus ( hopefully there are no issues
> we should know about ).  Though I just saw one posted today, but this is a
> few months off.

Should be OK, especially for new clusters. Test, test, test.

> 
> 4.  Any feedback on max OSDs?

I would recommend like 10 NVMe per server. More nodes is always better
than more dense nodes from a performance perspective, has less impact
when one node fails. The more nodes the less impact when one node fails,
faster recovery / backfill, etc.

> 5.  Right now they run 10Gb everywhere with 80Gb uplinks, I was thinking
> this would need at least 40Gb links to every node ( the hope is to use these
> to speed up image processing at the application layer locally in the DC ).

Do you want to be able to fully utilize all NVMe regarding throughput?
That will be an issue. You will be limited by bandwith to backfill those
OSDs (especially if you need to backfill a whole node at once).

> I haven't spoken to the Dell engineers yet but my concern with NVMe is that
> the raid controller would end up being the bottleneck ( next in line after
> network connectivity ).

Most probably, yes, plus increased latency. My standpoint is to not use
hardware raidcontrollers for NVMe storage.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx