Re: Global AVAIL vs Pool MAX AVAIL

Mark Johnson <markj@xxxxxxxxx> · Tue, 12 Jan 2021 02:53:54 +0000

Thanks Anthony,

Shortly after I made that post, I found a Server Fault post where someone had asked the exact same question.  The reply was this - "The 'MAX AVAIL' column represents the amount of data that can be used before the first OSD becomes full. It takes into account the projected distribution of data across disks from the CRUSH map and uses the 'first OSD to fill up' as the target."

To answer your question, yes we have a rather unbalanced cluster which is something I'm working on.  When I saw these figures, I got scared that I had less time to work on it than I thought.  There are about 10 pools in the cluster, but we primarily use one for almost all of our storage and it only has 64 pgs & 1 replica across 20 OSDs.  So, as data has grown, it works out that each PG in this cluster accounts for about 148GB, and the OSDs are about 1.4TB each, so it's easy to see how it's found itself way out of balance.

Anyway, once I've added the OSDs and data has rebalanced, I'm going to start the process of incrementally increasing the PG count for this pool in a staged process to reduce the amount of data per PG and (hopefully) balance out the data distribution better than it is.

This is one big learning process - I just wish I wasn't learning in production so much.

On Mon, 2021-01-11 at 15:58 -0800, Anthony D'Atri wrote:

Either you have multiple CRUSH roots or device classes, or you have unbalanced OSD utilization.  What version of Ceph?  Do you have any balancing enabled?

Do

ceph osd df | sort -nk8 | head

ceph osd df | sort -nk8 | tail

and I’ll bet you have OSDs way more full than others.  The STDDEV value that ceph df reports I suspect is accordingly high

On Jan 11, 2021, at 2:07 PM, Mark Johnson <

<mailto:markj@xxxxxxxxx>

markj@xxxxxxxxx

> wrote:

Can someone please explain to me the difference between the Global "AVAIL" and the "MAX AVAIL" in the pools table when I do a "ceph df detail"?  The reason being that we have a total of 14 pools, however almost all of our data exists in one pool.  A "ceph df detail" shows the following:

GLOBAL:

   SIZE       AVAIL     RAW USED     %RAW USED     OBJECTS

   28219G     6840G       19945G         70.68      36112k

But the POOLS table from the same output shows the MAX AVAIL for each pool as 498G and the pool with all the data shows 9472G used with a %USED of 95.00.  If it matters, the pool size is set to 2 so my guess is the global available figure is raw, meaning I should still have approx. 3.4TB available, but that 95% used has me concerned.  I'm going to be adding some OSDs soon but still would like to understand the difference and how much trouble I'm in at this point.

_______________________________________________

ceph-users mailing list --

<mailto:ceph-users@xxxxxxx>

ceph-users@xxxxxxx

To unsubscribe send an email to

<mailto:ceph-users-leave@xxxxxxx>

ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx