Re: Questions about the CRUSH details

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Thu, 25 Jan 2024 23:20:58 -0500

> 
>>> forth), so this is why "ceph df" will tell you a pool has X free
>>> space, where X is "smallest free space on the OSDs on which this pool
>>> lies, times the number of OSDs".

To be even more precise, this depends on the failure domain.  With the typical "rack" failure domain, say you use 3x replication and have 3 racks, you'll be limited to the capacity of the smallest rack. If you have more racks than failure domains, though, you are less affected racks that vary somewhat in CRUSH weight.

With respect to OSDs, the above is still true, which is one reason we have the balancer module.  Say your OSDs are on average 50% full but you have one that is 70% full.  The most-full outlier will limit the reported available space.

The available space for each pool is also a function of the replication strategy -- replication vs EC as well as the prevailing full ratio setting.

>>> Given the pseudorandom placement of
>>> objects to PGs, there is nothing to prevent you from having the worst
>>> luck ever and all the objects you create end up on the OSD with least
>>> free space.
>> 
>> This is why you need a decent amount of PGs, to not run into statistical
>> edge cases.
> 
> Yes, just take the experiment to someone with one PG only, then it can
> only fill one OSD. Someone with a pool with only 2 PGs could at the
> very best case only fill two and so on. If you have 100+ PGs per OSD,
> the chances for many files to end up only on a few PGs becomes very
> small.

Indeed, a healthy number of PG shards per OSD is important as well for this reason.  I use an analogy of filling a 55 gallon drum with sportsballs.  You can fit maybe two beach balls in there with a ton of air space, but you could fit thousands of pingpong balls in there with a lot less air space.  

Having a power of 2 number of PGs per pool also helps with uniform distribution -- the description of why this is the case is a bit abstruse so I'll spare the list, but enquiring minds can read chapter 8 ;)

> and every client can't have a complete list of millions of objects in
> the cluster, so it does client-side computations.

This is one reason we have PGs -- so that there's a manageable number of things to juggle, while not being so few as to run into statistical and other imbalances.

> 
> -- 
> May the most significant bit of your life be positive.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx