Re: crushmap rules :: host selection

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Mon, 29 Jan 2024 01:28:35 -0500

> 
> so .. in a PG there are no "file data" but pieces of "file data"?

Yes.  Chapter 8 may help here, but be warned, it’s pretty dense and may confuse more than help.

The foundation layer of Ceph is RADOS — services including block (RBD), file (CephFS), and object (RGW) storage are built on top of it.

RADOS handles placing data and ensuring that configured replication is maintained.

RADOS stores chunks of data in RADOS objects.  Which are not to be confused with S3 (or Swift) objects.  Yes, the term is overloaded and confusing.  I disambiguate in the docs whenever I come across an unspecified reference.

> so 100 GB file with 2x replication will be placed in more than 2 PGs?

For sure.  RADOS objects are if I recall correctly at most 4MiB in size.  I once found a 16MiB RADOS object in a cluster — one of my colleagues must have done something rather outré to create it.

So that 100GB file (block volume, S3 object …) will be split into at least 25 RADOS objects spread across some number of PGs within the pool in question.  Each PG is independently placed on — in your case — 2 OSDs, which by default will enforce anti-affinity:  those 2 OSDs won’t be on the same host.  When you have a larger number of OSDs, PGs will live on different pairs of OSDs.  In your extremely small cluster, if you have only 3 hosts and do 3-way replication, every PG will live on every host — but distributed among the OSDs on each host.  

> Is there some kind of strips or chunks that a given data conglomerate is split into
> and then _those_ are put in multiple PGs with the rule that they have to be in a Replicated=X ?

That sounds like a RADOS object.  But be clear that a PG generally has replicas (or shards) on multiple OSDs, it’s a one-to-many mapping.  A PG with the ID 1.ff might have replicas on, say, host1 (osd.3) and host2 (osd.11).  PG 1.11 might have replicas on host1 (osd.0) and host3 (osd.13).

>> Ceph has like 2000 “options” that can be set.  Most of them you don’t need to know about and should never touch.
> great! better to have many tunables and then just ask for usage then to have things hardcoded

Indeed.  And most or all of them are documented these days.  That wasn’t always the case ;)

There are a few things hardcoded, but nothing you need to worry about.  

> 
> The 2nd case, and yes for RBD i plan to use nvme (i have the OS images for now but i know that i can convert and import them into rbd)

Be sure to use enterprise-grade drives.  Client (consumer, desktop) drives are a false economy.  They often have limited durability and are prone to cliffing, where rather than presenting sustained performance, at some point performance will drop substantially.  They also may lack power loss protection, so if a server/rack/DC loses power suddenly, data in flight may be lost.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx