Re: Mix NVME's in a single cluster

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Mon, 27 Jan 2025 19:06:38 -0500

> I reckon that balancing is by far the biggest issue you are
> likely to have because most Ceph releases (I do not know about
> Reef) have difficulty balancing across drives of different
> sizes even with configuration changes.

There were some bugs around Firefly-Hammer with failure domains having very different aggregate weights, but with some forethought I’m not aware of recent situations where CRUSH fails in this scenario.  The pg autoscaler and balancer module may have difficulty with complex CRUSH topology, but I doubt the OP has such, and if so the JJ Balancer is reputed to work well.

Now, it is *ideal* to have and OSD monoculture, but Ceph does pretty well with homogenity.

Say one is doing 3x replication and has 3 failure domains, to keep it simple we’ll say 3 hosts.  If those *host* CRUSH buckets have aggregate weights like 100TB, 100TB, and 200TB, then the usable raw capacity will be 100TB †, because CRUSH has to place one copy of data on each host, and once the smaller two hosts are full, game over.

Now say the larger and smaller drives and thus OSDs are spread more or less evenly across the hosts, so that all three have ~133TB aggregate weights.  All raw capacity can be used.

This is one reason why it is advantageous when feasible to have at least one more failure domain than demanded by replication policy, so that Ceph can do the right thing.  Say we have 4 hosts now, 100TB, 100TB, 100TB, and 150TB, Ceph will be able to use most or all of the raw capacity.  With however 100TB, 100TB, 100TB, and 1000TB, that massive variance in failure domain weight probably would prevent all of the heaviest failure domain from being usable.

Note that the OP describes 10 hosts, only half populated with 15TB OSDs today.  With 10 hosts, the failure domain for CRUSH rules is most likely *host*, so adding 8x 30TB OSDs to each results in all failure domains being equal in weight.  Ceph will be able to use all of the raw capacity.

The larger OSDs (and thus their drives) will naturally receive approximately double the number of PGs compared to the smaller OSDs.  Thus those drives will be ~ twice as busy.  With NVMe that probably isn’t an issue, especially if the hosts are PCI Gen 4 or later, and adequate RAM is available.  The 30TB SSDs almost certainly are Gen 4 or later.

> * Assign different CRUSH weights. This configuration change is
>  "supposed" to work.

Short-stroking?  Sure it’ll work, but you’d waste 1.2PB of raw capacity, so that isn’t a great solution.

> * Assign the 30TB drives to a different class and use them for
>  new "pools".

Very possible, but probably not necessary, unless say one of the drive models is TLC and the other is QLC, in which case one may wish to segregate the workloads with pools.  

> 
> * Split each 30TB drive into two OSDs. Not a good idea for HDDs
>  of course but these are low latency SSDs.

You could do that.  If the number of OSDs and hosts were very low this might have a certain appeal - I’ve done that myself.  In the OP’s case, I think that wouldn’t accomplish much other than using more RAM.

> The main other problem with large capacity OSDs is the size of
> PGs, which can become very large with the default targets
> numbers of PGs, and a previous commenter mentioned that.

There are enough failure domains here that this wouldn’t be a showstopper, especially if pg_nums and/or the autoscaler’s target are raised to like 200-400.

> I think that the current configuration style where one sets the
> number of PGs rather than the size of PGs leads people astray.

Ceph places PGs based on CRUSH weight, so as long as pg_num for a given pool is a power of two, and there are a halfway decent number of OSDs — which in this case is true — the above strategies would seem roughly equivalent.

> In general my impression is that current Ceph defaults and its
> very design (a single level of grouping: PGs) were meant to be
> used with OSDs at most 1TB in size and larger OSDs are anyhow
> not a good idea,

I don’t follow, I know of no intrinsic issue with larger OSDs.  Were one to mix, say, 122TB OSDs and 1TB OSDs, or even 30TB OSDs and 1TB OSDs the imbalance could be detrimental to performance and one would need to pay close attention to the aforementioned overdose guardrails.  

> but of course there are many people who know
> better, and good luck to them.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

 † modulo base 2 vs 10, backfill/full ratios, etc.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx